From d55b666834eeab75080713588a13de9342c26cb1 Mon Sep 17 00:00:00 2001 From: bosilca Date: Mon, 26 Jun 2017 18:21:39 +0200 Subject: [PATCH] Topic/monitoring (#3109) Add a monitoring PML, OSC and IO. They track all data exchanges between processes, with capability to include or exclude collective traffic. The monitoring infrastructure is driven using MPI_T, and can be tuned of and on any time o any communicators/files/windows. Documentations and examples have been added, as well as a shared library that can be used with LD_PRELOAD and that allows the monitoring of any application. Signed-off-by: George Bosilca Signed-off-by: Clement Foyer * add ability to querry pml monitorinting results with MPI Tools interface using performance variables "pml_monitoring_messages_count" and "pml_monitoring_messages_size" Signed-off-by: George Bosilca * Fix a convertion problem and add a comment about the lack of component retain in the new component infrastructure. Signed-off-by: George Bosilca * Allow the pvar to be written by invoking the associated callback. Signed-off-by: George Bosilca * Various fixes for the monitoring. Allocate all counting arrays in a single allocation Don't delay the initialization (do it at the first add_proc as we know the number of processes in MPI_COMM_WORLD) Add a choice: with or without MPI_T (default). Signed-off-by: George Bosilca * Cleanup for the monitoring module. Fixed few bugs, and reshape the operations to prepare for global or communicator-based monitoring. Start integrating support for MPI_T as well as MCA monitoring. Signed-off-by: George Bosilca * Adding documentation about how to use pml_monitoring component. Document present the use with and without MPI_T. May not reflect exactly how it works right now, but should reflects how it should work in the end. Signed-off-by: Clement Foyer * Change rank into MPI_COMM_WORLD and size(MPI_COMM_WORLD) to global variables in pml_monitoring.c. Change mca_pml_monitoring_flush() signature so we don't need the size and rank parameters. Signed-off-by: George Bosilca * Improve monitoring support (including integration with MPI_T) Use mca_pml_monitoring_enable to check status state. Set mca_pml_monitoring_current_filename iif parameter is set Allow 3 modes for pml_monitoring_enable_output: - 1 : stdout; - 2 : stderr; - 3 : filename Fix test : 1 for differenciated messages, >1 for not differenciated. Fix output. Add documentation for pml_monitoring_enable_output parameter. Remove useless parameter in example Set filename only if using mpi tools Adding missing parameters for fprintf in monitoring_flush (for output in std's cases) Fix expected output/results for example header Fix exemple when using MPI_Tools : a null-pointer can't be passed directly. It needs to be a pointer to a null-pointer Base whether to output or not on message count, in order to print something if only empty messages are exchanged Add a new example on how to access performance variables from within the code Allocate arrays regarding value returned by binding Signed-off-by: Clement Foyer * Add overhead benchmark, with script to use data and create graphs out of the results Signed-off-by: Clement Foyer * Fix segfault error at end when not loading pml Signed-off-by: Clement Foyer * Start create common monitoring module. Factorise version numbering Signed-off-by: Clement Foyer * Fix microbenchmarks script Signed-off-by: Clement Foyer * Improve readability of code NULL can't be passed as a PVAR parameter value. It must be a pointer to NULL or an empty string. Signed-off-by: Clement Foyer * Add osc monitoring component Signed-off-by: Clement Foyer * Add error checking if running out of memory in osc_monitoring Signed-off-by: Clement Foyer * Resolve brutal segfault when double freeing filename Signed-off-by: Clement Foyer * Moving to ompi/mca/common the proper parts of the monitoring system Using common functions instead of pml specific one. Removing pml ones. Signed-off-by: Clement Foyer * Add calls to record monitored data from osc. Use common function to translate ranks. Signed-off-by: Clement Foyer * Fix test_overhead benchmark script distribution Signed-off-by: Clement Foyer * Fix linking library with mca/common Signed-off-by: Clement Foyer * Add passive operations in monitoring_test Signed-off-by: Clement Foyer * Fix from rank calculation. Add more detailed error messages Signed-off-by: Clement Foyer * Fix alignments. Fix common_monitoring_get_world_rank function. Remove useless trailing new lines Signed-off-by: Clement Foyer * Fix osc_monitoring mget_message_count function call Signed-off-by: Clement Foyer * Change common_monitoring function names to respect the naming convention. Move to common_finalize the common parts of finalization. Add some comments. Signed-off-by: Clement Foyer * Add monitoring common output system Signed-off-by: Clement Foyer * Add error message when trying to flush to a file, and open fails. Remove erroneous info message when flushing wereas the monitoring is already disabled. Signed-off-by: Clement Foyer * Consistent output file name (with and without MPI_T). Signed-off-by: Clement Foyer * Always output to a file when flushing at pvar_stop(flush). Signed-off-by: Clement Foyer * Update the monitoring documentation. Complete informations from HowTo. Fix a few mistake and typos. Signed-off-by: Clement Foyer * Use the world_rank for printf's. Fix name generation for output files when using MPI_T. Minor changes in benchmarks starting script Signed-off-by: Clement Foyer * Clean potential previous runs, but keep the results at the end in order to potentially reprocess the data. Add comments. Signed-off-by: Clement Foyer * Add security check for unique initialization for osc monitoring Signed-off-by: Clement Foyer * Clean the amout of symbols available outside mca/common/monitoring Signed-off-by: Clement Foyer * Remove use of __sync_* built-ins. Use opal_atomic_* instead. Signed-off-by: Clement Foyer * Allocate the hashtable on common/monitoring component initialization. Define symbols to set the values for error/warning/info verbose output. Use opal_atomic instead of built-in function in osc/monitoring template initialization. Signed-off-by: Clement Foyer * Deleting now useless file : moved to common/monitoring Signed-off-by: Clement Foyer * Add histogram ditribution of message sizes Signed-off-by: Clement Foyer * Add histogram array of 2-based log of message sizes. Use simple call to reset/allocate arrays in common_monitoring.c Signed-off-by: Clement Foyer * Add informations in dumping file. Separate per category (pt2pt/osc/coll (to come)) monitored data Signed-off-by: Clement Foyer * Add coll component for collectives communications monitoring Signed-off-by: Clement Foyer * Fix warning messages : use c_name as the magic id is not always defined. Moreover, there was a % missing. Add call to release underlying modules. Add debug info messages. Add warning which may lead to further analysis. Signed-off-by: Clement Foyer * Fix log10_2 constant initialization. Fix index calculation for histogram array. Signed-off-by: Clement Foyer * Add debug info messages to follow more easily initialization steps. Signed-off-by: Clement Foyer * Group all the var/pvar definitions to common_monitoring. Separate initial filename from the current on, to ease its lifetime management. Add verifications to ensure common is initialized once only. Move state variable management to common_monitoring. monitoring_filter only indicates if filtering is activated. Fix out of range access in histogram. List is not used with the struct mca_monitoring_coll_data_t, so heritate only from opal_object_t. Remove useless dead code. Signed-off-by: Clement Foyer * Fix invalid memory allocation. Initialize initial_filename to empty string to avoid invalid read in mca_base_var_register. Signed-off-by: Clement Foyer * Don't install the test scripts. Signed-off-by: George Bosilca Signed-off-by: Clement Foyer * Fix missing procs in hashtable. Cache coll monitoring data. * Add MCA_PML_BASE_FLAG_REQUIRE_WORLD flag to the PML layer. * Cache monitoring data relative to collectives operations on creation. * Remove double caching. * Use same proc name definition for hash table when inserting and when retrieving. Signed-off-by: Clement Foyer * Use intermediate variable to avoid invalid write while retrieving ranks in hashtable. Signed-off-by: Clement Foyer * Add missing release of the last element in flush_all. Add release of the hashtable in finalize. Signed-off-by: Clement Foyer * Use a linked list instead of a hashtable to keep tracks of communicator data. Add release of the structure at finalize time. Signed-off-by: Clement Foyer * Set world_rank from hashtable only if found Signed-off-by: Clement Foyer * Use predefined symbol from opal system to print int Signed-off-by: Clement Foyer * Move collective monitoring data to a hashtable. Add pvar to access the monitoring_coll_data. Move functions header to a private file only to be used in ompi/mca/common/monitoring Signed-off-by: Clement Foyer * Fix pvar registration. Use OMPI_ERROR isntead of -1 as returned error value. Fix releasing of coll_data_t objects. Affect value only if data is found in the hashtable. Signed-off-by: Clement Foyer * Add automated check (with MPI_Tools) of monitoring. Signed-off-by: Clement Foyer * Fix procs list caching in common_monitoring_coll_data_t * Fix monitoring_coll_data type definition. * Use size(COMM_WORLD)-1 to determine max number of digits. Signed-off-by: Clement Foyer * Add linking to Fortran applications for LD_PRELOAD usage of monitoring_prof Signed-off-by: Clement Foyer * Add PVAR's handles. Clean up code (visibility, add comments...). Start updating the documentation Signed-off-by: Clement Foyer * Fix coll operations monitoring. Update check_monitoring accordingly to the added pvar. Fix monitoring array allocation. Signed-off-by: Clement Foyer * Documentation update. Update and then move the latex and README documentation to a more logical place Signed-off-by: Clement Foyer * Aggregate monitoring COLL data to the generated matrix. Update documentation accordingly. Signed-off-by: Clement Foyer * Fix monitoring_prof (bad variable.vector used, and wrong array in PMPI_Gather). Signed-off-by: Clement Foyer * Add reduce_scatter and reduce_scatter_block monitoring. Reduce memory footprint of monitoring_prof. Unify OSC related outputs. Signed-off-by: Clement Foyer * Add the use of a machine file for overhead benchmark Signed-off-by: Clement Foyer * Check for out-of-bound write in histogram Signed-off-by: Clement Foyer * Fix common_monitoring_cache object init for MPI_COMM_WORLD Signed-off-by: Clement Foyer * Add RDMA benchmarks to test_overhead Add error file output. Add MPI_Put and MPI_Get results analysis. Add overhead computation for complete sending (pingpong / 2). Signed-off-by: Clement Foyer * Add computation of average and median of overheads. Add comments and copyrigths to the test_overhead script Signed-off-by: Clement Foyer * Add technical documentation Signed-off-by: Clement Foyer * Adapt to the new definition of communicators Signed-off-by: Clement Foyer * Update expected output in test/monitoring/monitoring_test.c Signed-off-by: Clement Foyer * Add dumping histogram in edge case Signed-off-by: Clement Foyer * Adding a reduce(pml_monitoring_messages_count, MPI_MAX) example Signed-off-by: Clement Foyer * Add consistency in header inclusion. Include ompi/mpi/fortran/mpif-h/bindings.h only if needed. Add sanity check before emptying hashtable. Fix typos in documentation. Signed-off-by: Clement Foyer * misc monitoring fixes * test/monitoring: fix test when weak symbols are not available * monitoring: fix a typo and add a missing file in Makefile.am and have monitoring_common.h and monitoring_common_coll.h included in the distro * test/monitoring: cleanup all tests and make distclean a happy panda * test/monitoring: use gettimeofday() if clock_gettime() is unavailable * monitoring: silence misc warnings (#3) Signed-off-by: Gilles Gouaillardet * Cleanups. Signed-off-by: George Bosilca * Changing int64_t to size_t. Keep the size_t used accross all monitoring components. Adapt the documentation. Remove useless MPI_Request and MPI_Status from monitoring_test.c. Signed-off-by: Clement Foyer * Add parameter for RMA test case Signed-off-by: Clement Foyer * Clean the maximum bound computation for proc list dump. Use ptrdiff_t instead of OPAL_PTRDIFF_TYPE to reflect the changes from commit fa5cd0dbe5d261bd9d2cc61d5b305b4ef6a2dda6. Signed-off-by: Clement Foyer * Add communicator-specific monitored collective data reset Signed-off-by: Clement Foyer * Add monitoring scripts to the 'make dist' Also install them in the build and the install directories. Signed-off-by: George Bosilca --- configure.ac | 4 + ompi/mca/coll/base/coll_base_find_available.c | 49 +- ompi/mca/coll/monitoring/Makefile.am | 53 + ompi/mca/coll/monitoring/coll_monitoring.h | 385 +++++ .../monitoring/coll_monitoring_allgather.c | 71 + .../monitoring/coll_monitoring_allgatherv.c | 71 + .../monitoring/coll_monitoring_allreduce.c | 70 + .../monitoring/coll_monitoring_alltoall.c | 69 + .../monitoring/coll_monitoring_alltoallv.c | 75 + .../monitoring/coll_monitoring_alltoallw.c | 77 + .../coll/monitoring/coll_monitoring_barrier.c | 56 + .../coll/monitoring/coll_monitoring_bcast.c | 73 + .../monitoring/coll_monitoring_component.c | 255 ++++ .../coll/monitoring/coll_monitoring_exscan.c | 68 + .../coll/monitoring/coll_monitoring_gather.c | 71 + .../coll/monitoring/coll_monitoring_gatherv.c | 77 + .../coll_monitoring_neighbor_allgather.c | 120 ++ .../coll_monitoring_neighbor_allgatherv.c | 124 ++ .../coll_monitoring_neighbor_alltoall.c | 122 ++ .../coll_monitoring_neighbor_alltoallv.c | 130 ++ .../coll_monitoring_neighbor_alltoallw.c | 132 ++ .../coll/monitoring/coll_monitoring_reduce.c | 74 + .../coll_monitoring_reduce_scatter.c | 74 + .../coll_monitoring_reduce_scatter_block.c | 72 + .../coll/monitoring/coll_monitoring_scan.c | 68 + .../coll/monitoring/coll_monitoring_scatter.c | 78 + .../monitoring/coll_monitoring_scatterv.c | 73 + .../monitoring/HowTo_pml_monitoring.tex | 1298 +++++++++++++++++ ompi/mca/common/monitoring/Makefile.am | 50 + ompi/mca/{pml => common}/monitoring/README | 0 .../mca/common/monitoring/common_monitoring.c | 795 ++++++++++ .../mca/common/monitoring/common_monitoring.h | 120 ++ .../monitoring/common_monitoring_coll.c | 380 +++++ .../monitoring/common_monitoring_coll.h | 59 + ompi/mca/osc/monitoring/Makefile.am | 38 + ompi/mca/osc/monitoring/configure.m4 | 19 + ompi/mca/osc/monitoring/osc_monitoring.h | 29 + .../monitoring/osc_monitoring_accumulate.h | 175 +++ .../monitoring/osc_monitoring_active_target.h | 48 + ompi/mca/osc/monitoring/osc_monitoring_comm.h | 118 ++ .../osc/monitoring/osc_monitoring_component.c | 154 ++ .../osc/monitoring/osc_monitoring_dynamic.h | 27 + .../osc/monitoring/osc_monitoring_module.h | 89 ++ .../osc_monitoring_passive_target.h | 63 + .../osc/monitoring/osc_monitoring_template.h | 79 + ompi/mca/pml/monitoring/Makefile.am | 3 +- ompi/mca/pml/monitoring/pml_monitoring.c | 258 ---- ompi/mca/pml/monitoring/pml_monitoring.h | 27 +- ompi/mca/pml/monitoring/pml_monitoring_comm.c | 4 +- .../pml/monitoring/pml_monitoring_component.c | 197 +-- .../pml/monitoring/pml_monitoring_iprobe.c | 4 +- .../mca/pml/monitoring/pml_monitoring_irecv.c | 4 +- .../mca/pml/monitoring/pml_monitoring_isend.c | 24 +- .../mca/pml/monitoring/pml_monitoring_start.c | 17 +- opal/mca/base/mca_base_pvar.c | 2 + test/monitoring/Makefile.am | 39 +- test/monitoring/aggregate_profile.pl | 4 +- test/monitoring/check_monitoring.c | 516 +++++++ test/monitoring/example_reduce_count.c | 127 ++ test/monitoring/monitoring_prof.c | 268 +++- test/monitoring/monitoring_test.c | 433 ++++-- test/monitoring/profile2mat.pl | 8 +- test/monitoring/test_overhead.c | 294 ++++ test/monitoring/test_overhead.sh | 216 +++ test/monitoring/test_pvar_access.c | 323 ++++ 65 files changed, 8216 insertions(+), 684 deletions(-) create mode 100644 ompi/mca/coll/monitoring/Makefile.am create mode 100644 ompi/mca/coll/monitoring/coll_monitoring.h create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_allgather.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_allgatherv.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_allreduce.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_alltoall.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_alltoallv.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_alltoallw.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_barrier.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_bcast.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_component.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_exscan.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_gather.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_gatherv.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgather.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgatherv.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoall.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallv.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallw.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_reduce.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter_block.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_scan.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_scatter.c create mode 100644 ompi/mca/coll/monitoring/coll_monitoring_scatterv.c create mode 100644 ompi/mca/common/monitoring/HowTo_pml_monitoring.tex create mode 100644 ompi/mca/common/monitoring/Makefile.am rename ompi/mca/{pml => common}/monitoring/README (100%) create mode 100644 ompi/mca/common/monitoring/common_monitoring.c create mode 100644 ompi/mca/common/monitoring/common_monitoring.h create mode 100644 ompi/mca/common/monitoring/common_monitoring_coll.c create mode 100644 ompi/mca/common/monitoring/common_monitoring_coll.h create mode 100644 ompi/mca/osc/monitoring/Makefile.am create mode 100644 ompi/mca/osc/monitoring/configure.m4 create mode 100644 ompi/mca/osc/monitoring/osc_monitoring.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_accumulate.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_active_target.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_comm.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_component.c create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_dynamic.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_module.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_passive_target.h create mode 100644 ompi/mca/osc/monitoring/osc_monitoring_template.h delete mode 100644 ompi/mca/pml/monitoring/pml_monitoring.c create mode 100644 test/monitoring/check_monitoring.c create mode 100644 test/monitoring/example_reduce_count.c create mode 100644 test/monitoring/test_overhead.c create mode 100755 test/monitoring/test_overhead.sh create mode 100644 test/monitoring/test_pvar_access.c diff --git a/configure.ac b/configure.ac index 764c72276c5..deb5a68031c 100644 --- a/configure.ac +++ b/configure.ac @@ -1409,6 +1409,10 @@ AC_CONFIG_FILES([ test/util/Makefile ]) m4_ifdef([project_ompi], [AC_CONFIG_FILES([test/monitoring/Makefile])]) +m4_ifdef([project_ompi], [ + m4_ifdef([MCA_BUILD_ompi_pml_monitoring_DSO_TRUE], + [AC_CONFIG_LINKS(test/monitoring/profile2mat.pl:test/monitoring/profile2mat.pl + test/monitoring/aggregate_profile.pl:test/monitoring/aggregate_profile.pl)])]) AC_CONFIG_FILES([contrib/dist/mofed/debian/rules], [chmod +x contrib/dist/mofed/debian/rules]) diff --git a/ompi/mca/coll/base/coll_base_find_available.c b/ompi/mca/coll/base/coll_base_find_available.c index e1f69d4ba47..b2e25944f3f 100644 --- a/ompi/mca/coll/base/coll_base_find_available.c +++ b/ompi/mca/coll/base/coll_base_find_available.c @@ -2,7 +2,7 @@ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2005 The University of Tennessee and The University + * Copyright (c) 2004-2017 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, @@ -46,9 +46,6 @@ static int init_query(const mca_base_component_t * ls, bool enable_progress_threads, bool enable_mpi_threads); -static int init_query_2_0_0(const mca_base_component_t * ls, - bool enable_progress_threads, - bool enable_mpi_threads); /* * Scan down the list of successfully opened components and query each of @@ -105,6 +102,20 @@ int mca_coll_base_find_available(bool enable_progress_threads, } +/* + * Query a specific component, coll v2.0.0 + */ +static inline int +init_query_2_0_0(const mca_base_component_t * component, + bool enable_progress_threads, + bool enable_mpi_threads) +{ + mca_coll_base_component_2_0_0_t *coll = + (mca_coll_base_component_2_0_0_t *) component; + + return coll->collm_init_query(enable_progress_threads, + enable_mpi_threads); +} /* * Query a component, see if it wants to run at all. If it does, save * some information. If it doesn't, close it. @@ -138,33 +149,11 @@ static int init_query(const mca_base_component_t * component, } /* Query done -- look at the return value to see what happened */ - - if (OMPI_SUCCESS != ret) { - opal_output_verbose(10, ompi_coll_base_framework.framework_output, - "coll:find_available: coll component %s is not available", - component->mca_component_name); - } else { - opal_output_verbose(10, ompi_coll_base_framework.framework_output, - "coll:find_available: coll component %s is available", - component->mca_component_name); - } - - /* All done */ + opal_output_verbose(10, ompi_coll_base_framework.framework_output, + "coll:find_available: coll component %s is %savailable", + component->mca_component_name, + (OMPI_SUCCESS == ret) ? "": "not "); return ret; } - -/* - * Query a specific component, coll v2.0.0 - */ -static int init_query_2_0_0(const mca_base_component_t * component, - bool enable_progress_threads, - bool enable_mpi_threads) -{ - mca_coll_base_component_2_0_0_t *coll = - (mca_coll_base_component_2_0_0_t *) component; - - return coll->collm_init_query(enable_progress_threads, - enable_mpi_threads); -} diff --git a/ompi/mca/coll/monitoring/Makefile.am b/ompi/mca/coll/monitoring/Makefile.am new file mode 100644 index 00000000000..10893b0075e --- /dev/null +++ b/ompi/mca/coll/monitoring/Makefile.am @@ -0,0 +1,53 @@ +# +# Copyright (c) 2016 Inria. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +monitoring_sources = \ + coll_monitoring.h \ + coll_monitoring_allgather.c \ + coll_monitoring_allgatherv.c \ + coll_monitoring_allreduce.c \ + coll_monitoring_alltoall.c \ + coll_monitoring_alltoallv.c \ + coll_monitoring_alltoallw.c \ + coll_monitoring_barrier.c \ + coll_monitoring_bcast.c \ + coll_monitoring_component.c \ + coll_monitoring_exscan.c \ + coll_monitoring_gather.c \ + coll_monitoring_gatherv.c \ + coll_monitoring_neighbor_allgather.c \ + coll_monitoring_neighbor_allgatherv.c \ + coll_monitoring_neighbor_alltoall.c \ + coll_monitoring_neighbor_alltoallv.c \ + coll_monitoring_neighbor_alltoallw.c \ + coll_monitoring_reduce.c \ + coll_monitoring_reduce_scatter.c \ + coll_monitoring_reduce_scatter_block.c \ + coll_monitoring_scan.c \ + coll_monitoring_scatter.c \ + coll_monitoring_scatterv.c + +if MCA_BUILD_ompi_coll_monitoring_DSO +component_noinst = +component_install = mca_coll_monitoring.la +else +component_noinst = libmca_coll_monitoring.la +component_install = +endif + +mcacomponentdir = $(ompilibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_coll_monitoring_la_SOURCES = $(monitoring_sources) +mca_coll_monitoring_la_LDFLAGS = -module -avoid-version +mca_coll_monitoring_la_LIBADD = \ + $(OMPI_TOP_BUILDDIR)/ompi/mca/common/monitoring/libmca_common_monitoring.la + +noinst_LTLIBRARIES = $(component_noinst) +libmca_coll_monitoring_la_SOURCES = $(monitoring_sources) +libmca_coll_monitoring_la_LDFLAGS = -module -avoid-version diff --git a/ompi/mca/coll/monitoring/coll_monitoring.h b/ompi/mca/coll/monitoring/coll_monitoring.h new file mode 100644 index 00000000000..1cd001d8a74 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring.h @@ -0,0 +1,385 @@ +/* + * Copyright (c) 2016 Inria. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_COLL_MONITORING_H +#define MCA_COLL_MONITORING_H + +BEGIN_C_DECLS + +#include +#include +#include +#include +#include +#include +#include + +struct mca_coll_monitoring_component_t { + mca_coll_base_component_t super; + int priority; +}; +typedef struct mca_coll_monitoring_component_t mca_coll_monitoring_component_t; + +OMPI_DECLSPEC extern mca_coll_monitoring_component_t mca_coll_monitoring_component; + +struct mca_coll_monitoring_module_t { + mca_coll_base_module_t super; + mca_coll_base_comm_coll_t real; + mca_monitoring_coll_data_t*data; + int64_t is_initialized; +}; +typedef struct mca_coll_monitoring_module_t mca_coll_monitoring_module_t; +OMPI_DECLSPEC OBJ_CLASS_DECLARATION(mca_coll_monitoring_module_t); + +/* + * Coll interface functions + */ + +/* Blocking */ +extern int mca_coll_monitoring_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_allreduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_alltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_alltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_alltoallw(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_barrier(struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_bcast(void *buff, int count, + struct ompi_datatype_t *datatype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_exscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_gather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_gatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_reduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_reduce_scatter(const void *sbuf, void *rbuf, + const int *rcounts, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_reduce_scatter_block(const void *sbuf, void *rbuf, + int rcount, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_scan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_scatter(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_scatterv(const void *sbuf, const int *scounts, const int *disps, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +/* Nonblocking */ +extern int mca_coll_monitoring_iallgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iallgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iallreduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ialltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ialltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ialltoallw(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ibarrier(struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ibcast(void *buff, int count, + struct ompi_datatype_t *datatype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iexscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_igather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_igatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ireduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ireduce_scatter(const void *sbuf, void *rbuf, + const int *rcounts, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ireduce_scatter_block(const void *sbuf, void *rbuf, + int rcount, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iscatter(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_iscatterv(const void *sbuf, const int *scounts, const int *disps, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +/* Neighbor */ +extern int mca_coll_monitoring_neighbor_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void *rbuf, + int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_neighbor_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void * rbuf, + const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_neighbor_alltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_neighbor_alltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_neighbor_alltoallw(const void *sbuf, const int *scounts, + const MPI_Aint *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const MPI_Aint *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ineighbor_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void *rbuf, + int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ineighbor_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void * rbuf, const int *rcounts, + const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ineighbor_alltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void *rbuf, + int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ineighbor_alltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +extern int mca_coll_monitoring_ineighbor_alltoallw(const void *sbuf, const int *scounts, + const MPI_Aint *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const MPI_Aint *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module); + +END_C_DECLS + +#endif /* MCA_COLL_MONITORING_H */ diff --git a/ompi/mca/coll/monitoring/coll_monitoring_allgather.c b/ompi/mca/coll/monitoring/coll_monitoring_allgather.c new file mode 100644 index 00000000000..5b9b5d26a2e --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_allgather.c @@ -0,0 +1,71 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( i == my_rank ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_allgather(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, monitoring_module->real.coll_allgather_module); +} + +int mca_coll_monitoring_iallgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_iallgather(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, request, monitoring_module->real.coll_iallgather_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_allgatherv.c b/ompi/mca/coll/monitoring/coll_monitoring_allgatherv.c new file mode 100644 index 00000000000..2bc7985009b --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_allgatherv.c @@ -0,0 +1,71 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void * rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_allgatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, comm, monitoring_module->real.coll_allgatherv_module); +} + +int mca_coll_monitoring_iallgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void * rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_iallgatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, comm, request, monitoring_module->real.coll_iallgatherv_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_allreduce.c b/ompi/mca/coll/monitoring/coll_monitoring_allreduce.c new file mode 100644 index 00000000000..95905070006 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_allreduce.c @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_allreduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_allreduce(sbuf, rbuf, count, dtype, op, comm, monitoring_module->real.coll_allreduce_module); +} + +int mca_coll_monitoring_iallreduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_iallreduce(sbuf, rbuf, count, dtype, op, comm, request, monitoring_module->real.coll_iallreduce_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_alltoall.c b/ompi/mca/coll/monitoring/coll_monitoring_alltoall.c new file mode 100644 index 00000000000..33dfbaed01f --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_alltoall.c @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_alltoall(const void *sbuf, int scount, struct ompi_datatype_t *sdtype, + void* rbuf, int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_alltoall(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, monitoring_module->real.coll_alltoall_module); +} + +int mca_coll_monitoring_ialltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_ialltoall(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, request, monitoring_module->real.coll_ialltoall_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_alltoallv.c b/ompi/mca/coll/monitoring/coll_monitoring_alltoallv.c new file mode 100644 index 00000000000..acdd0d4b5f9 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_alltoallv.c @@ -0,0 +1,75 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_alltoallv(const void *sbuf, const int *scounts, const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_alltoallv(sbuf, scounts, sdisps, sdtype, rbuf, rcounts, rdisps, rdtype, comm, monitoring_module->real.coll_alltoallv_module); +} + +int mca_coll_monitoring_ialltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_ialltoallv(sbuf, scounts, sdisps, sdtype, rbuf, rcounts, rdisps, rdtype, comm, request, monitoring_module->real.coll_ialltoallv_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_alltoallw.c b/ompi/mca/coll/monitoring/coll_monitoring_alltoallw.c new file mode 100644 index 00000000000..d573e970506 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_alltoallw.c @@ -0,0 +1,77 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_alltoallw(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_alltoallw(sbuf, scounts, sdisps, sdtypes, rbuf, rcounts, rdisps, rdtypes, comm, monitoring_module->real.coll_alltoallw_module); +} + +int mca_coll_monitoring_ialltoallw(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_ialltoallw(sbuf, scounts, sdisps, sdtypes, rbuf, rcounts, rdisps, rdtypes, comm, request, monitoring_module->real.coll_ialltoallw_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_barrier.c b/ompi/mca/coll/monitoring/coll_monitoring_barrier.c new file mode 100644 index 00000000000..7e8af198893 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_barrier.c @@ -0,0 +1,56 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_barrier(struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + int i, rank; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, 0); + } + } + mca_common_monitoring_coll_a2a(0, monitoring_module->data); + return monitoring_module->real.coll_barrier(comm, monitoring_module->real.coll_barrier_module); +} + +int mca_coll_monitoring_ibarrier(struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + int i, rank; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, 0); + } + } + mca_common_monitoring_coll_a2a(0, monitoring_module->data); + return monitoring_module->real.coll_ibarrier(comm, request, monitoring_module->real.coll_ibarrier_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_bcast.c b/ompi/mca/coll/monitoring/coll_monitoring_bcast.c new file mode 100644 index 00000000000..0fc1488dae8 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_bcast.c @@ -0,0 +1,73 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_bcast(void *buff, int count, + struct ompi_datatype_t *datatype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(datatype, &type_size); + data_size = count * type_size; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + mca_common_monitoring_coll_o2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( i == root ) continue; /* No self sending */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + } + return monitoring_module->real.coll_bcast(buff, count, datatype, root, comm, monitoring_module->real.coll_bcast_module); +} + +int mca_coll_monitoring_ibcast(void *buff, int count, + struct ompi_datatype_t *datatype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(datatype, &type_size); + data_size = count * type_size; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + mca_common_monitoring_coll_o2a(data_size * (comm_size - 1), monitoring_module->data); + for( i = 0; i < comm_size; ++i ) { + if( i == root ) continue; /* No self sending */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + } + return monitoring_module->real.coll_ibcast(buff, count, datatype, root, comm, request, monitoring_module->real.coll_ibcast_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_component.c b/ompi/mca/coll/monitoring/coll_monitoring_component.c new file mode 100644 index 00000000000..2e61a1c87e0 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_component.c @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include "coll_monitoring.h" +#include +#include +#include +#include + +#define MONITORING_SAVE_PREV_COLL_API(__module, __comm, __api) \ + do { \ + if( NULL != __comm->c_coll->coll_ ## __api ## _module ) { \ + __module->real.coll_ ## __api = __comm->c_coll->coll_ ## __api; \ + __module->real.coll_ ## __api ## _module = __comm->c_coll->coll_ ## __api ## _module; \ + OBJ_RETAIN(__module->real.coll_ ## __api ## _module); \ + } else { \ + /* If no function previously provided, do not monitor */ \ + __module->super.coll_ ## __api = NULL; \ + OPAL_MONITORING_PRINT_WARN("COMM \"%s\": No monitoring available for " \ + "coll_" # __api, __comm->c_name); \ + } \ + if( NULL != __comm->c_coll->coll_i ## __api ## _module ) { \ + __module->real.coll_i ## __api = __comm->c_coll->coll_i ## __api; \ + __module->real.coll_i ## __api ## _module = __comm->c_coll->coll_i ## __api ## _module; \ + OBJ_RETAIN(__module->real.coll_i ## __api ## _module); \ + } else { \ + /* If no function previously provided, do not monitor */ \ + __module->super.coll_i ## __api = NULL; \ + OPAL_MONITORING_PRINT_WARN("COMM \"%s\": No monitoring available for " \ + "coll_i" # __api, __comm->c_name); \ + } \ + } while(0) + +#define MONITORING_RELEASE_PREV_COLL_API(__module, __comm, __api) \ + do { \ + if( NULL != __module->real.coll_ ## __api ## _module ) { \ + if( NULL != __module->real.coll_ ## __api ## _module->coll_module_disable ) { \ + __module->real.coll_ ## __api ## _module->coll_module_disable(__module->real.coll_ ## __api ## _module, __comm); \ + } \ + OBJ_RELEASE(__module->real.coll_ ## __api ## _module); \ + __module->real.coll_ ## __api = NULL; \ + __module->real.coll_ ## __api ## _module = NULL; \ + } \ + if( NULL != __module->real.coll_i ## __api ## _module ) { \ + if( NULL != __module->real.coll_i ## __api ## _module->coll_module_disable ) { \ + __module->real.coll_i ## __api ## _module->coll_module_disable(__module->real.coll_i ## __api ## _module, __comm); \ + } \ + OBJ_RELEASE(__module->real.coll_i ## __api ## _module); \ + __module->real.coll_i ## __api = NULL; \ + __module->real.coll_i ## __api ## _module = NULL; \ + } \ + } while(0) + +#define MONITORING_SET_FULL_PREV_COLL_API(m, c, operation) \ + do { \ + operation(m, c, allgather); \ + operation(m, c, allgatherv); \ + operation(m, c, allreduce); \ + operation(m, c, alltoall); \ + operation(m, c, alltoallv); \ + operation(m, c, alltoallw); \ + operation(m, c, barrier); \ + operation(m, c, bcast); \ + operation(m, c, exscan); \ + operation(m, c, gather); \ + operation(m, c, gatherv); \ + operation(m, c, reduce); \ + operation(m, c, reduce_scatter); \ + operation(m, c, reduce_scatter_block); \ + operation(m, c, scan); \ + operation(m, c, scatter); \ + operation(m, c, scatterv); \ + operation(m, c, neighbor_allgather); \ + operation(m, c, neighbor_allgatherv); \ + operation(m, c, neighbor_alltoall); \ + operation(m, c, neighbor_alltoallv); \ + operation(m, c, neighbor_alltoallw); \ + } while(0) + +#define MONITORING_SAVE_FULL_PREV_COLL_API(m, c) \ + MONITORING_SET_FULL_PREV_COLL_API((m), (c), MONITORING_SAVE_PREV_COLL_API) + +#define MONITORING_RELEASE_FULL_PREV_COLL_API(m, c) \ + MONITORING_SET_FULL_PREV_COLL_API((m), (c), MONITORING_RELEASE_PREV_COLL_API) + +static int mca_coll_monitoring_component_open(void) +{ + return OMPI_SUCCESS; +} + +static int mca_coll_monitoring_component_close(void) +{ + OPAL_MONITORING_PRINT_INFO("coll_module_close"); + mca_common_monitoring_finalize(); + return OMPI_SUCCESS; +} + +static int mca_coll_monitoring_component_init(bool enable_progress_threads, + bool enable_mpi_threads) +{ + OPAL_MONITORING_PRINT_INFO("coll_module_init"); + mca_common_monitoring_init(); + return OMPI_SUCCESS; +} + +static int mca_coll_monitoring_component_register(void) +{ + return OMPI_SUCCESS; +} + +static int +mca_coll_monitoring_module_enable(mca_coll_base_module_t*module, struct ompi_communicator_t*comm) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( 1 == opal_atomic_add_64(&monitoring_module->is_initialized, 1) ) { + MONITORING_SAVE_FULL_PREV_COLL_API(monitoring_module, comm); + monitoring_module->data = mca_common_monitoring_coll_new(comm); + OPAL_MONITORING_PRINT_INFO("coll_module_enabled"); + } + return OMPI_SUCCESS; +} + +static int +mca_coll_monitoring_module_disable(mca_coll_base_module_t*module, struct ompi_communicator_t*comm) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( 0 == opal_atomic_sub_64(&monitoring_module->is_initialized, 1) ) { + MONITORING_RELEASE_FULL_PREV_COLL_API(monitoring_module, comm); + mca_common_monitoring_coll_release(monitoring_module->data); + monitoring_module->data = NULL; + OPAL_MONITORING_PRINT_INFO("coll_module_disabled"); + } + return OMPI_SUCCESS; +} + +static int mca_coll_monitoring_ft_event(int state) +{ + switch(state) { + case OPAL_CRS_CHECKPOINT: + case OPAL_CRS_CONTINUE: + case OPAL_CRS_RESTART: + case OPAL_CRS_TERM: + default: + ; + } + return OMPI_SUCCESS; +} + +static mca_coll_base_module_t* +mca_coll_monitoring_component_query(struct ompi_communicator_t*comm, int*priority) +{ + OPAL_MONITORING_PRINT_INFO("coll_module_query"); + mca_coll_monitoring_module_t*monitoring_module = OBJ_NEW(mca_coll_monitoring_module_t); + if( NULL == monitoring_module ) return (*priority = -1, NULL); + + /* Initialize module functions */ + monitoring_module->super.coll_module_enable = mca_coll_monitoring_module_enable; + monitoring_module->super.coll_module_disable = mca_coll_monitoring_module_disable; + monitoring_module->super.ft_event = mca_coll_monitoring_ft_event; + + /* Initialise module collectives functions */ + /* Blocking functions */ + monitoring_module->super.coll_allgather = mca_coll_monitoring_allgather; + monitoring_module->super.coll_allgatherv = mca_coll_monitoring_allgatherv; + monitoring_module->super.coll_allreduce = mca_coll_monitoring_allreduce; + monitoring_module->super.coll_alltoall = mca_coll_monitoring_alltoall; + monitoring_module->super.coll_alltoallv = mca_coll_monitoring_alltoallv; + monitoring_module->super.coll_alltoallw = mca_coll_monitoring_alltoallw; + monitoring_module->super.coll_barrier = mca_coll_monitoring_barrier; + monitoring_module->super.coll_bcast = mca_coll_monitoring_bcast; + monitoring_module->super.coll_exscan = mca_coll_monitoring_exscan; + monitoring_module->super.coll_gather = mca_coll_monitoring_gather; + monitoring_module->super.coll_gatherv = mca_coll_monitoring_gatherv; + monitoring_module->super.coll_reduce = mca_coll_monitoring_reduce; + monitoring_module->super.coll_reduce_scatter = mca_coll_monitoring_reduce_scatter; + monitoring_module->super.coll_reduce_scatter_block = mca_coll_monitoring_reduce_scatter_block; + monitoring_module->super.coll_scan = mca_coll_monitoring_scan; + monitoring_module->super.coll_scatter = mca_coll_monitoring_scatter; + monitoring_module->super.coll_scatterv = mca_coll_monitoring_scatterv; + + /* Nonblocking functions */ + monitoring_module->super.coll_iallgather = mca_coll_monitoring_iallgather; + monitoring_module->super.coll_iallgatherv = mca_coll_monitoring_iallgatherv; + monitoring_module->super.coll_iallreduce = mca_coll_monitoring_iallreduce; + monitoring_module->super.coll_ialltoall = mca_coll_monitoring_ialltoall; + monitoring_module->super.coll_ialltoallv = mca_coll_monitoring_ialltoallv; + monitoring_module->super.coll_ialltoallw = mca_coll_monitoring_ialltoallw; + monitoring_module->super.coll_ibarrier = mca_coll_monitoring_ibarrier; + monitoring_module->super.coll_ibcast = mca_coll_monitoring_ibcast; + monitoring_module->super.coll_iexscan = mca_coll_monitoring_iexscan; + monitoring_module->super.coll_igather = mca_coll_monitoring_igather; + monitoring_module->super.coll_igatherv = mca_coll_monitoring_igatherv; + monitoring_module->super.coll_ireduce = mca_coll_monitoring_ireduce; + monitoring_module->super.coll_ireduce_scatter = mca_coll_monitoring_ireduce_scatter; + monitoring_module->super.coll_ireduce_scatter_block = mca_coll_monitoring_ireduce_scatter_block; + monitoring_module->super.coll_iscan = mca_coll_monitoring_iscan; + monitoring_module->super.coll_iscatter = mca_coll_monitoring_iscatter; + monitoring_module->super.coll_iscatterv = mca_coll_monitoring_iscatterv; + + /* Neighborhood functions */ + monitoring_module->super.coll_neighbor_allgather = mca_coll_monitoring_neighbor_allgather; + monitoring_module->super.coll_neighbor_allgatherv = mca_coll_monitoring_neighbor_allgatherv; + monitoring_module->super.coll_neighbor_alltoall = mca_coll_monitoring_neighbor_alltoall; + monitoring_module->super.coll_neighbor_alltoallv = mca_coll_monitoring_neighbor_alltoallv; + monitoring_module->super.coll_neighbor_alltoallw = mca_coll_monitoring_neighbor_alltoallw; + monitoring_module->super.coll_ineighbor_allgather = mca_coll_monitoring_ineighbor_allgather; + monitoring_module->super.coll_ineighbor_allgatherv = mca_coll_monitoring_ineighbor_allgatherv; + monitoring_module->super.coll_ineighbor_alltoall = mca_coll_monitoring_ineighbor_alltoall; + monitoring_module->super.coll_ineighbor_alltoallv = mca_coll_monitoring_ineighbor_alltoallv; + monitoring_module->super.coll_ineighbor_alltoallw = mca_coll_monitoring_ineighbor_alltoallw; + + /* Initialization flag */ + monitoring_module->is_initialized = 0; + + *priority = mca_coll_monitoring_component.priority; + + return &(monitoring_module->super); +} + +mca_coll_monitoring_component_t mca_coll_monitoring_component = { + .super = { + /* First, the mca_base_component_t struct containing meta + information about the component itself */ + .collm_version = { + MCA_COLL_BASE_VERSION_2_0_0, + + .mca_component_name = "monitoring", /* MCA component name */ + MCA_MONITORING_MAKE_VERSION, + .mca_open_component = mca_coll_monitoring_component_open, /* component open */ + .mca_close_component = mca_coll_monitoring_component_close, /* component close */ + .mca_register_component_params = mca_coll_monitoring_component_register + }, + .collm_data = { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + .collm_init_query = mca_coll_monitoring_component_init, + .collm_comm_query = mca_coll_monitoring_component_query + }, + .priority = INT_MAX +}; + +OBJ_CLASS_INSTANCE(mca_coll_monitoring_module_t, + mca_coll_base_module_t, + NULL, + NULL); + diff --git a/ompi/mca/coll/monitoring/coll_monitoring_exscan.c b/ompi/mca/coll/monitoring/coll_monitoring_exscan.c new file mode 100644 index 00000000000..8621506b66d --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_exscan.c @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_exscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - my_rank), monitoring_module->data); + for( i = my_rank + 1; i < comm_size; ++i ) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_exscan(sbuf, rbuf, count, dtype, op, comm, monitoring_module->real.coll_exscan_module); +} + +int mca_coll_monitoring_iexscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - my_rank), monitoring_module->data); + for( i = my_rank + 1; i < comm_size; ++i ) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_iexscan(sbuf, rbuf, count, dtype, op, comm, request, monitoring_module->real.coll_iexscan_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_gather.c b/ompi/mca/coll/monitoring/coll_monitoring_gather.c new file mode 100644 index 00000000000..bd377773f52 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_gather.c @@ -0,0 +1,71 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_gather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(rdtype, &type_size); + data_size = rcount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2o(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_gather(sbuf, scount, sdtype, rbuf, rcount, rdtype, root, comm, monitoring_module->real.coll_gather_module); +} + +int mca_coll_monitoring_igather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(rdtype, &type_size); + data_size = rcount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2o(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_igather(sbuf, scount, sdtype, rbuf, rcount, rdtype, root, comm, request, monitoring_module->real.coll_igather_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_gatherv.c b/ompi/mca/coll/monitoring/coll_monitoring_gatherv.c new file mode 100644 index 00000000000..cd5c876d5dc --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_gatherv.c @@ -0,0 +1,77 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_gatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(rdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + data_size = rcounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2o(data_size_aggreg, monitoring_module->data); + } + return monitoring_module->real.coll_gatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, root, comm, monitoring_module->real.coll_gatherv_module); +} + +int mca_coll_monitoring_igatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(rdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + data_size = rcounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_a2o(data_size_aggreg, monitoring_module->data); + } + return monitoring_module->real.coll_igatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, root, comm, request, monitoring_module->real.coll_igatherv_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgather.c b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgather.c new file mode 100644 index 00000000000..e7da655ff2e --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgather.c @@ -0,0 +1,120 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_neighbor_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void *rbuf, + int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_neighbor_allgather(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, monitoring_module->real.coll_neighbor_allgather_module); +} + +int mca_coll_monitoring_ineighbor_allgather(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, void *rbuf, + int rcount, struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_ineighbor_allgather(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, request, monitoring_module->real.coll_ineighbor_allgather_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgatherv.c b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgatherv.c new file mode 100644 index 00000000000..e7def27d584 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_allgatherv.c @@ -0,0 +1,124 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_neighbor_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void * rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_2_2_0_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_neighbor_allgatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, comm, monitoring_module->real.coll_neighbor_allgatherv_module); +} + +int mca_coll_monitoring_ineighbor_allgatherv(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void * rbuf, const int *rcounts, const int *disps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_2_2_0_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_ineighbor_allgatherv(sbuf, scount, sdtype, rbuf, rcounts, disps, rdtype, comm, request, monitoring_module->real.coll_ineighbor_allgatherv_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoall.c b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoall.c new file mode 100644 index 00000000000..72d189b4876 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoall.c @@ -0,0 +1,122 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_neighbor_alltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void* rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_neighbor_alltoall(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, monitoring_module->real.coll_neighbor_alltoall_module); +} + +int mca_coll_monitoring_ineighbor_alltoall(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + + for( dim = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + + if (MPI_PROC_NULL != drank) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_ineighbor_alltoall(sbuf, scount, sdtype, rbuf, rcount, rdtype, comm, request, monitoring_module->real.coll_ineighbor_alltoall_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallv.c b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallv.c new file mode 100644 index 00000000000..028f284785a --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallv.c @@ -0,0 +1,130 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_neighbor_alltoallv(const void *sbuf, const int *scounts, + const int *sdisps, struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, i, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + + for( dim = 0, i = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + + if (MPI_PROC_NULL != drank) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_neighbor_alltoallv(sbuf, scounts, sdisps, sdtype, rbuf, rcounts, rdisps, rdtype, comm, monitoring_module->real.coll_neighbor_alltoallv_module); +} + +int mca_coll_monitoring_ineighbor_alltoallv(const void *sbuf, const int *scounts, + const int *sdisps, + struct ompi_datatype_t *sdtype, + void *rbuf, const int *rcounts, + const int *rdisps, + struct ompi_datatype_t *rdtype, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, i, srank, drank, world_rank; + + ompi_datatype_type_size(sdtype, &type_size); + + for( dim = 0, i = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + + if (MPI_PROC_NULL != drank) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_ineighbor_alltoallv(sbuf, scounts, sdisps, sdtype, rbuf, rcounts, rdisps, rdtype, comm, request, monitoring_module->real.coll_ineighbor_alltoallv_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallw.c b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallw.c new file mode 100644 index 00000000000..e17edba783f --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_neighbor_alltoallw.c @@ -0,0 +1,132 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_neighbor_alltoallw(const void *sbuf, const int *scounts, + const MPI_Aint *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const MPI_Aint *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, i, srank, drank, world_rank; + + for( dim = 0, i = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + + if (MPI_PROC_NULL != drank) { + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_neighbor_alltoallw(sbuf, scounts, sdisps, sdtypes, rbuf, rcounts, rdisps, rdtypes, comm, monitoring_module->real.coll_neighbor_alltoallw_module); +} + +int mca_coll_monitoring_ineighbor_alltoallw(const void *sbuf, const int *scounts, + const MPI_Aint *sdisps, + struct ompi_datatype_t * const *sdtypes, + void *rbuf, const int *rcounts, + const MPI_Aint *rdisps, + struct ompi_datatype_t * const *rdtypes, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const mca_topo_base_comm_cart_t *cart = comm->c_topo->mtc.cart; + int dim, i, srank, drank, world_rank; + + for( dim = 0, i = 0; dim < cart->ndims; ++dim ) { + srank = MPI_PROC_NULL, drank = MPI_PROC_NULL; + + if (cart->dims[dim] > 1) { + mca_topo_base_cart_shift (comm, dim, 1, &srank, &drank); + } else if (1 == cart->dims[dim] && cart->periods[dim]) { + /* Don't record exchanges with self */ + continue; + } + + if (MPI_PROC_NULL != srank) { + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(srank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + + if (MPI_PROC_NULL != drank) { + ompi_datatype_type_size(sdtypes[i], &type_size); + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(drank, comm, &world_rank) ) { + mca_common_monitoring_record_coll(world_rank, data_size); + data_size_aggreg += data_size; + } + ++i; + } + } + + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + + return monitoring_module->real.coll_ineighbor_alltoallw(sbuf, scounts, sdisps, sdtypes, rbuf, rcounts, rdisps, rdtypes, comm, request, monitoring_module->real.coll_ineighbor_alltoallw_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_reduce.c b/ompi/mca/coll/monitoring/coll_monitoring_reduce.c new file mode 100644 index 00000000000..35a73ee6ac8 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_reduce.c @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_reduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2o(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_reduce(sbuf, rbuf, count, dtype, op, root, comm, monitoring_module->real.coll_reduce_module); +} + +int mca_coll_monitoring_ireduce(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + if( root == ompi_comm_rank(comm) ) { + int i, rank; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + for( i = 0; i < comm_size; ++i ) { + if( root == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2o(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_ireduce(sbuf, rbuf, count, dtype, op, root, comm, request, monitoring_module->real.coll_ireduce_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter.c b/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter.c new file mode 100644 index 00000000000..e921258af16 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter.c @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_reduce_scatter(const void *sbuf, void *rbuf, + const int *rcounts, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + data_size = rcounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + data_size_aggreg += data_size; + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_reduce_scatter(sbuf, rbuf, rcounts, dtype, op, comm, monitoring_module->real.coll_reduce_scatter_module); +} + +int mca_coll_monitoring_ireduce_scatter(const void *sbuf, void *rbuf, + const int *rcounts, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + data_size = rcounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + data_size_aggreg += data_size; + } + mca_common_monitoring_coll_a2a(data_size_aggreg, monitoring_module->data); + return monitoring_module->real.coll_ireduce_scatter(sbuf, rbuf, rcounts, dtype, op, comm, request, monitoring_module->real.coll_ireduce_scatter_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter_block.c b/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter_block.c new file mode 100644 index 00000000000..a869fc2a594 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_reduce_scatter_block.c @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_reduce_scatter_block(const void *sbuf, void *rbuf, + int rcount, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = rcount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + return monitoring_module->real.coll_reduce_scatter_block(sbuf, rbuf, rcount, dtype, op, comm, monitoring_module->real.coll_reduce_scatter_block_module); +} + +int mca_coll_monitoring_ireduce_scatter_block(const void *sbuf, void *rbuf, + int rcount, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = rcount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_a2a(data_size * (comm_size - 1), monitoring_module->data); + return monitoring_module->real.coll_ireduce_scatter_block(sbuf, rbuf, rcount, dtype, op, comm, request, monitoring_module->real.coll_ireduce_scatter_block_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_scan.c b/ompi/mca/coll/monitoring/coll_monitoring_scan.c new file mode 100644 index 00000000000..ff307a7acfb --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_scan.c @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_scan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - my_rank), monitoring_module->data); + for( i = my_rank + 1; i < comm_size; ++i ) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_scan(sbuf, rbuf, count, dtype, op, comm, monitoring_module->real.coll_scan_module); +} + +int mca_coll_monitoring_iscan(const void *sbuf, void *rbuf, int count, + struct ompi_datatype_t *dtype, + struct ompi_op_t *op, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + const int my_rank = ompi_comm_rank(comm); + int i, rank; + ompi_datatype_type_size(dtype, &type_size); + data_size = count * type_size; + mca_common_monitoring_coll_a2a(data_size * (comm_size - my_rank), monitoring_module->data); + for( i = my_rank + 1; i < comm_size; ++i ) { + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + return monitoring_module->real.coll_iscan(sbuf, rbuf, count, dtype, op, comm, request, monitoring_module->real.coll_iscan_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_scatter.c b/ompi/mca/coll/monitoring/coll_monitoring_scatter.c new file mode 100644 index 00000000000..3aab77d7f87 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_scatter.c @@ -0,0 +1,78 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_scatter(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + const int my_rank = ompi_comm_rank(comm); + if( root == my_rank ) { + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_o2a(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_scatter(sbuf, scount, sdtype, rbuf, rcount, rdtype, root, comm, monitoring_module->real.coll_scatter_module); +} + + +int mca_coll_monitoring_iscatter(const void *sbuf, int scount, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, + struct ompi_datatype_t *rdtype, + int root, + struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + const int my_rank = ompi_comm_rank(comm); + if( root == my_rank ) { + size_t type_size, data_size; + const int comm_size = ompi_comm_size(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + data_size = scount * type_size; + for( i = 0; i < comm_size; ++i ) { + if( my_rank == i ) continue; /* No communication for self */ + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + } + } + mca_common_monitoring_coll_o2a(data_size * (comm_size - 1), monitoring_module->data); + } + return monitoring_module->real.coll_iscatter(sbuf, scount, sdtype, rbuf, rcount, rdtype, root, comm, request, monitoring_module->real.coll_iscatter_module); +} diff --git a/ompi/mca/coll/monitoring/coll_monitoring_scatterv.c b/ompi/mca/coll/monitoring/coll_monitoring_scatterv.c new file mode 100644 index 00000000000..f187741cab2 --- /dev/null +++ b/ompi/mca/coll/monitoring/coll_monitoring_scatterv.c @@ -0,0 +1,73 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include +#include +#include "coll_monitoring.h" + +int mca_coll_monitoring_scatterv(const void *sbuf, const int *scounts, const int *disps, + struct ompi_datatype_t *sdtype, + void* rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + const int my_rank = ompi_comm_rank(comm); + if( root == my_rank ) { + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_o2a(data_size_aggreg, monitoring_module->data); + } + return monitoring_module->real.coll_scatterv(sbuf, scounts, disps, sdtype, rbuf, rcount, rdtype, root, comm, monitoring_module->real.coll_scatterv_module); +} + +int mca_coll_monitoring_iscatterv(const void *sbuf, const int *scounts, const int *disps, + struct ompi_datatype_t *sdtype, + void *rbuf, int rcount, struct ompi_datatype_t *rdtype, + int root, struct ompi_communicator_t *comm, + ompi_request_t ** request, + mca_coll_base_module_t *module) +{ + mca_coll_monitoring_module_t*monitoring_module = (mca_coll_monitoring_module_t*) module; + const int my_rank = ompi_comm_rank(comm); + if( root == my_rank ) { + size_t type_size, data_size, data_size_aggreg = 0; + const int comm_size = ompi_comm_size(comm); + int i, rank; + ompi_datatype_type_size(sdtype, &type_size); + for( i = 0; i < comm_size; ++i ) { + data_size = scounts[i] * type_size; + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + if( OPAL_SUCCESS == mca_common_monitoring_get_world_rank(i, comm, &rank) ) { + mca_common_monitoring_record_coll(rank, data_size); + data_size_aggreg += data_size; + } + } + mca_common_monitoring_coll_o2a(data_size_aggreg, monitoring_module->data); + } + return monitoring_module->real.coll_iscatterv(sbuf, scounts, disps, sdtype, rbuf, rcount, rdtype, root, comm, request, monitoring_module->real.coll_iscatterv_module); +} diff --git a/ompi/mca/common/monitoring/HowTo_pml_monitoring.tex b/ompi/mca/common/monitoring/HowTo_pml_monitoring.tex new file mode 100644 index 00000000000..752ed464520 --- /dev/null +++ b/ompi/mca/common/monitoring/HowTo_pml_monitoring.tex @@ -0,0 +1,1298 @@ +% Copyright (c) 2016-2017 Inria. All rights reserved. +% $COPYRIGHT$ +% +% Additional copyrights may follow +% +% $HEADER$ + +\documentclass[notitlepage]{article} + +\usepackage[english]{babel} +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc} +\usepackage[a4paper]{geometry} +\usepackage{verbatim} +\usepackage{dirtree} + +\title{How to use Open~MPI monitoring component} + +\author{C. FOYER - INRIA} + +\newcommand{\mpit}[1]{\textit{MPI\_Tool#1}} +\newcommand{\ompi}[0]{Open~MPI} +\newcommand{\brkunds}[0]{\allowbreak\_} + +\begin{document} + +\maketitle + +\section{Introduction} + +\mpit{} is a concept introduced in the MPI-3 standard. It allows MPI +developers, or third party, to offer a portable interface to different +tools. These tools may be used to monitor application, measure its +performances, or profile it. \mpit{} is an interface that ease the +addition of external functions to a MPI library. It also allows the +user to control and monitor given internal variables of the runtime +system. + +The present document is here to introduce the use the \mpit{} +interface from a user point of view, and to facilitate the usage of +the \ompi{} monitoring component. This component allows for +precisely recording the message exchanges between nodes during MPI +applications execution. The number of messages and the amount of data +exchanged are recorded, including or excluding internal communications +(such as those generated by the implementation of the collective +algorithms). + +This component offers two types of monitoring, whether the user wants +a fine control over the monitoring, or just an overall view of the +messages. Moreover, the fine control allows the user to access the +results through the application, and let him reset the variables when +needed. The fine control is achieved via the \mpit{} interface, which +needs the code to be adapted by adding a specific initialization +function. However, the basic overall monitoring is achieved without +any modification of the application code. + +Whether you are using one version or the other, the monitoring need to +be enabled with parameters added when calling \texttt{mpiexec}, or +globally on your \ompi{} MCA configuration file +(\${HOME}/openmpi/mca-param.conf). Three new parameters have been +introduced: +\begin{description} +\item [\texttt{-{}-mca pml\brkunds{}monitoring\brkunds{}enable value}] + This parameter sets the monitoring mode. \texttt{value} may be: + \begin{description} + \item [0] monitoring is disabled + \item [1] monitoring is enabled, with no distinction between user + issued and library issued messages. + \item [$\ge$ 2] monitoring enabled, with a distinction between + messages issued from the library ({\bf internal}) and messages + issued from the user ({\bf external}). + \end{description} +\item [\texttt{-{}-mca + pml\brkunds{}monitoring\brkunds{}enable\brkunds{}output value}] + This parameter enables the automatic flushing of monitored values + during the call to \texttt{MPI\brkunds{}Finalize}. {\bf This option + is to be used only without \mpit{}, or with \texttt{value} = + 0}. \texttt{value} may be: + \begin{description} + \item [0] final output flushing is disable + \item [1] final output flushing is done in the standard output + stream (\texttt{stdout}) + \item [2] final output flushing is done in the error output stream + (\texttt{stderr}) + \item [$\ge$ 3] final output flushing is done in the file which name + is given with the + \texttt{pml\brkunds{}monitoring\brkunds{}filename} parameter. + \end{description} + Each MPI process flushes its recorded data. The pieces of + information can be aggregated whether with the use of PMPI (see + Section~\ref{subsec:ldpreload}) or with the distributed script {\it + test/monitoring/profile2mat.pl}. +\item [\texttt{-{}-mca pml\brkunds{}monitoring\brkunds{}filename + filename}] Set the file where to flush the resulting output from + monitoring. The output is a communication matrix of both the number + of messages and the total size of exchanged data between each couple + of nodes. This parameter is needed if + \texttt{pml\brkunds{}monitoring\brkunds{}enable\brkunds{}output} + $\ge$ 3. +\end{description} + + +Also, in order to run an application without some monitoring enabled, +you need to add the following parameters at mpiexec time: +\begin{description} +\item [\texttt{-{}-mca pml \^{}monitoring}] This parameter disable the + monitoring component of the PML framework +\item [\texttt{-{}-mca osc \^{}monitoring}] This parameter disable the + monitoring component of the OSC framework +\item [\texttt{-{}-mca coll \^{}monitoring}] This parameter disable + the monitoring component of the COLL framework +\end{description} + +\section{Without \mpit{}} + +This mode should be used to monitor the whole application from its +start until its end. It is defined such as you can record the amount +of communications without any code modification. + +In order to do so, you have to get \ompi{} compiled with monitoring +enabled. When you launch your application, you need to set the +parameter \texttt{pml\brkunds{}monitoring\brkunds{}enable} to a value +$> 0$, and, if +\texttt{pml\brkunds{}monitoring\brkunds{}enable\brkunds{}output} $\ge$ +3, to set the \texttt{pml\brkunds{}monitoring\brkunds{}filename} +parameter to a proper filename, which path must exists. + +\section{With \mpit{}} + +This section explains how to monitor your applications with the use +of \mpit{}. + +\subsection{How it works} + +\mpit{} is a layer that is added to the standard MPI +implementation. As such, it must be noted first that it may have an +impact to the performances. + +As these functionality are orthogonal to the core ones, \mpit{} +initialization and finalization are independent from MPI's one. There +is no restriction regarding the order or the different calls. Also, +the \mpit{} interface initialization function can be called more than +once within the execution, as long as the finalize function is called +as many times. + +\mpit{} introduces two types of variables, \textit{control variables} +and \textit{performance variables}. These variables will be referred +to respectively as \textit{cvar} and \textit{pvar}. The variables can +be used to tune dynamically the application to fit best the needs of +the application. They are defined by the library (or by the external +component), and accessed with the given accessors functions, specified +in the standard. The variables are named uniquely through the +application. Every variable, once defined and registered within the +MPI engine, is given an index that will not change during the entire +execution. + +Same as for the monitoring without \mpit{}, you need to start your +application with the control variable +\textit{pml\brkunds{}monitoring\brkunds{}enable} properly set. Even +though, it is not required, you can also add for your command line the +desired filename to flush the monitoring output. As long as no +filename is provided, no output can be generated. + +\subsection{Initialization} + +The initialization is made by a call to +\texttt{MPI\brkunds{}T\brkunds{}init\brkunds{}thread}. This function +takes two parameters. The first one is the desired level of thread +support, the second one is the provided level of thread support. It +has the same semantic as the +\texttt{MPI\brkunds{}Init\brkunds{}thread} function. Please note that +the first function to be called (between +\texttt{MPI\brkunds{}T\brkunds{}init\brkunds{}thread} and +\texttt{MPI\brkunds{}Init\brkunds{}thread}) may influence the second +one for the provided level of thread support. This function goal is to +initialize control and performance variables. + +But, in order to use the performance variables within one context +without influencing the one from an other context, a variable has to +be bound to a session. To create a session, you have to call +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}session\brkunds{}create} +in order to initialize a session. + +In addition to the binding of a session, a performance variable may +also depend on a MPI object. For example, the +\textit{pml\brkunds{}monitoring\brkunds{}flush} variable needs to be +bound to a communicator. In order to do so, you need to use the +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}handle\brkunds{}alloc} +function, which takes as parameters the used session, the id of the +variable, the MPI object +(i.e. \texttt{MPI\brkunds{}COMM\brkunds{}WORLD} in the case of +\textit{pml\brkunds{}monitoring\brkunds{}flush}), the reference to the +performance variable handle and a reference to an integer value. The +last parameter allow the user to receive some additional information +about the variable, or the MPI object bound. As an example, when +binding to the \textit{pml\brkunds{}monitoring\brkunds{}flush} +performance variable, the last parameter is set to the length of the +current filename used for the flush, if any, and 0 otherwise ; when +binding to the +\textit{pml\brkunds{}monitoring\brkunds{}messages\brkunds{}count} +performance variable, the parameter is set to the size of the size of +bound communicator, as it corresponds to the expected size of the +array (in number of elements) when retrieving the data. This parameter +is used to let the application determines the amount of data to be +returned when reading the performance variables. Please note that the +\textit{handle\brkunds{}alloc} function takes the variable id as +parameter. In order to retrieve this value, you have to call +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}get\brkunds{}index} +which take as a IN parameter a string that contains the name of the +desired variable. + +\subsection{How to use the performance variables} + +Some performance variables are defined in the monitoring component: +\begin{description} +\item [\textit{pml\brkunds{}monitoring\brkunds{}flush}] Allow the user + to define a file where to flush the recorded data. +\item + [\textit{pml\brkunds{}monitoring\brkunds{}messages\brkunds{}count}] + Allow the user to access within the application the number of + messages exchanged through the PML framework with each node from the + bound communicator (\textit{MPI\brkunds{}Comm}). This variable + returns an array of number of nodes size typed integers. +\item + [\textit{pml\brkunds{}monitoring\brkunds{}messages\brkunds{}size}] + Allow the user to access within the application the amount of data + exchanged through the PML framework with each node from the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns an + array of number of nodes size typed integers. +\item + [\textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}sent\brkunds{}count}] + Allow the user to access within the application the number of + messages sent through the OSC framework with each node from the + bound communicator (\textit{MPI\brkunds{}Comm}). This variable + returns an array of number of nodes size typed integers. +\item + [\textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}sent\brkunds{}size}] + Allow the user to access within the application the amount of data + sent through the OSC framework with each node from the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns an + array of number of nodes size typed integers. +\item + [\textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}recv\brkunds{}count}] + Allow the user to access within the application the number of + messages received through the OSC framework with each node from the + bound communicator (\textit{MPI\brkunds{}Comm}). This variable + returns an array of number of nodes size typed integers. +\item + [\textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}recv\brkunds{}size}] + Allow the user to access within the application the amount of data + received through the OSC framework with each node from the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns an + array of number of nodes size typed integers. +\item + [\textit{coll\brkunds{}monitoring\brkunds{}messages\brkunds{}count}] + Allow the user to access within the application the number of + messages exchanged through the COLL framework with each node from + the bound communicator (\textit{MPI\brkunds{}Comm}). This variable + returns an array of number of nodes size typed integers. +\item + [\textit{coll\brkunds{}monitoring\brkunds{}messages\brkunds{}size}] + Allow the user to access within the application the amount of data + exchanged through the COLL framework with each node from the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns an + array of number of nodes size typed integers. +\item [\textit{coll\brkunds{}monitoring\brkunds{}o2a\brkunds{}count}] + Allow the user to access within the application the number of + one-to-all collective operations across the bound communicator + (\textit{MPI\brkunds{}Comm}) where the process was defined as + root. This variable returns a single size typed integer. +\item [\textit{coll\brkunds{}monitoring\brkunds{}o2a\brkunds{}size}] + Allow the user to access within the application the amount of data + sent as one-to-all collective operations across the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns a + single size typed integers. The communications between a process + and itself are not taken in account +\item [\textit{coll\brkunds{}monitoring\brkunds{}a2o\brkunds{}count}] + Allow the user to access within the application the number of + all-to-one collective operations across the bound communicator + (\textit{MPI\brkunds{}Comm}) where the process was defined as + root. This variable returns a single size typed integer. +\item [\textit{coll\brkunds{}monitoring\brkunds{}a2o\brkunds{}size}] + Allow the user to access within the application the amount of data + received from all-to-one collective operations across the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns a + single size typed integers. The communications between a process + and itself are not taken in account +\item [\textit{coll\brkunds{}monitoring\brkunds{}a2a\brkunds{}count}] + Allow the user to access within the application the number of + all-to-all collective operations across the bound communicator + (\textit{MPI\brkunds{}Comm}). This variable returns a single + size typed integer. +\item [\textit{coll\brkunds{}monitoring\brkunds{}a2a\brkunds{}size}] + Allow the user to access within the application the amount of data + sent as all-to-all collective operations across the bound + communicator (\textit{MPI\brkunds{}Comm}). This variable returns a + single size typed integers. The communications between a process + and itself are not taken in account +\end{description} + +In case of uncertainty about how a collective in categorized as, please refer to the list given in Table~\ref{tab:coll-cat}. + +\begin{table} + \begin{center} + \begin{tabular}{|l|l|l|} + \hline + One-To-All & All-To-One & All-To-All \\ + \hline + MPI\_Bcast & MPI\_Gather & MPI\_Allgather \\ + MPI\_Ibcast & MPI\_Gatherv & MPI\_Allgatherv \\ + MPI\_Iscatter & MPI\_Igather & MPI\_Allreduce \\ + MPI\_Iscatterv & MPI\_Igatherv & MPI\_Alltoall \\ + MPI\_Scatter & MPI\_Ireduce & MPI\_Alltoallv \\ + MPI\_Scatterv & MPI\_Reduce & MPI\_Alltoallw \\ + && MPI\_Barrier \\ + && MPI\_Exscan \\ + && MPI\_Iallgather \\ + && MPI\_Iallgatherv \\ + && MPI\_Iallreduce \\ + && MPI\_Ialltoall \\ + && MPI\_Ialltoallv \\ + && MPI\_Ialltoallw \\ + && MPI\_Ibarrier \\ + && MPI\_Iexscan \\ + && MPI\_Ineighbor\_allgather \\ + && MPI\_Ineighbor\_allgatherv \\ + && MPI\_Ineighbor\_alltoall \\ + && MPI\_Ineighbor\_alltoallv \\ + && MPI\_Ineighbor\_alltoallw \\ + && MPI\_Ireduce\_scatter \\ + && MPI\_Ireduce\_scatter\_block \\ + && MPI\_Iscan \\ + && MPI\_Neighbor\_allgather \\ + && MPI\_Neighbor\_allgatherv \\ + && MPI\_Neighbor\_alltoall \\ + && MPI\_Neighbor\_alltoallv \\ + && MPI\_Neighbor\_alltoallw \\ + && MPI\_Reduce\_scatter \\ + && MPI\_Reduce\_scatter\_block \\ + && MPI\_Scan \\ + \hline + \end{tabular} +\end{center} + \caption{Collective Operations Categorization} + \label{tab:coll-cat} +\end{table} + +Once bound to a session and to the proper MPI object, these variables +may be accessed through a set of given functions. It must be noted +here that each of the functions applied to the different variables +need, in fact, to be called with the handle of the variable. + +The first variable may be modified by using the +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}write} function. The +later variables may be read using +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}read} but cannot be +written. Stopping the \textit{flush} performance variable, with a call +to \texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}stop}, force the +counters to be flushed into the given file, reseting to 0 the counters +at the same time. Also, binding a new handle to the \textit{flush} +variable will reset the counters. Finally, please note that the size +and counter performance variables may overflow for multiple large +amounts of communications. + +The monitoring will start on the call to the +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}start} until the moment +you call the \texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}stop} +function. + +Once you are done with the different monitoring, you can clean +everything by calling the function +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}handle\brkunds{}free} to +free the allocated handles, +\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}session\brkunds{}free} +to free the session, and \texttt{MPI\brkunds{}T\brkunds{}Finalize} to +state the end of your use of performance and control variables. + +\subsection{Overview of the calls} + +To summarize the previous informations, here is the list of available +performance variables, and the outline of the different calls to be +used to properly access monitored data through the \mpit{} interface. +\begin{itemize} +\item \textit{pml\brkunds{}monitoring\brkunds{}flush} +\item + \textit{pml\brkunds{}monitoring\brkunds{}messages\brkunds{}count} +\item \textit{pml\brkunds{}monitoring\brkunds{}messages\brkunds{}size} +\item + \textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}sent\brkunds{}count} +\item + \textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}sent\brkunds{}size} +\item + \textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}recv\brkunds{}count} +\item + \textit{osc\brkunds{}monitoring\brkunds{}messages\brkunds{}recv\brkunds{}size} +\item + \textit{coll\brkunds{}monitoring\brkunds{}messages\brkunds{}count} +\item + \textit{coll\brkunds{}monitoring\brkunds{}messages\brkunds{}size} +\item \textit{coll\brkunds{}monitoring\brkunds{}o2a\brkunds{}count} +\item \textit{coll\brkunds{}monitoring\brkunds{}o2a\brkunds{}size} +\item \textit{coll\brkunds{}monitoring\brkunds{}a2o\brkunds{}count} +\item \textit{coll\brkunds{}monitoring\brkunds{}a2o\brkunds{}size} +\item \textit{coll\brkunds{}monitoring\brkunds{}a2a\brkunds{}count} +\item \textit{coll\brkunds{}monitoring\brkunds{}a2a\brkunds{}size} +\end{itemize} +Add to your command line at least \texttt{-{}-mca + pml\brkunds{}monitoring\brkunds{}enable [1,2]} \\ Sequence of +\mpit{}: +\begin{enumerate} +\item {\texttt{MPI\brkunds{}T\brkunds{}init\brkunds{}thread}} + Initialize the MPI\brkunds{}Tools interface +\item + {\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}get\brkunds{}index}} + To retrieve the variable id +\item {\texttt{MPI\brkunds{}T\brkunds{}session\brkunds{}create}} To + create a new context in which you use your variable +\item {\texttt{MPI\brkunds{}T\brkunds{}handle\brkunds{}alloc}} To bind + your variable to the proper session and MPI object +\item {\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}start}} To start + the monitoring +\item Now you do all the communications you want to monitor +\item {\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}stop}} To stop + and flush the monitoring +\item + {\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}handle\brkunds{}free}} +\item + {\texttt{MPI\brkunds{}T\brkunds{}pvar\brkunds{}session\brkunds{}free}} +\item {\texttt{MPI\brkunds{}T\brkunds{}finalize}} +\end{enumerate} + +\subsection{Use of \textsc{LD\brkunds{}PRELOAD}} +\label{subsec:ldpreload} + +In order to automatically generate communication matrices, you can use +the {\it monitoring\brkunds{}prof} tool that can be found in +\textit{test/monitoring/monitoring\brkunds{}prof.c}. While launching +your application, you can add the following option in addition to the +\texttt{-{}-mca pml\brkunds{}monitoring\brkunds{}enable} parameter: +\begin{description} +\item [\texttt{-x + LD\_PRELOAD=ompi\_install\_dir/lib/monitoring\_prof.so}] +\end{description} + +This library automatically gathers sent and received data into one +communication matrix. Although, the use of monitoring \mpit{} within +the code may interfere with this library. The main goal of this +library is to avoid dumping one file per MPI process, and gather +everything in one file aggregating all pieces of information. + +The resulting communication matrices are as close as possible as the +effective amount of data exchanged between nodes. But it has to be +kept in mind that because of the stack of the logical layers in +\ompi{}, the amount of data recorded as part of collectives or +one-sided operations may be duplicated when the PML layer handles the +communication. For an exact measure of communications, the application +must use \mpit{}'s monitoring performance variables to potentially +subtract double-recorded data. + +\subsection{Examples} + +First is presented an example of monitoring using the \mpit{} in order +to define phases during which the monitoring component is active. A +second snippet is presented for how to access monitoring performance +variables with \mpit{}. + +\subsubsection{Monitoring Phases} + +You can execute the following example with +\\ \verb|mpiexec -n 4 --mca pml_monitoring_enable 2 test_monitoring|. Please +note that you need the prof directory to already exists to retrieve +the dumped files. Following the complete code example, you will find a +sample dumped file and the corresponding explanations. + +\paragraph{test\_monitoring.c} (extract) + +\begin{verbatim} +#include +#include +#include + +static const void* nullbuff = NULL; +static MPI_T_pvar_handle flush_handle; +static const char flush_pvar_name[] = "pml_monitoring_flush"; +static const char flush_cvar_name[] = "pml_monitoring_enable"; +static int flush_pvar_idx; + +int main(int argc, char* argv[]) +{ + int rank, size, n, to, from, tagno, MPIT_result, provided, count; + MPI_T_pvar_session session; + MPI_Status status; + MPI_Comm newcomm; + MPI_Request request; + char filename[1024]; + + /* Initialization of parameters */ + + n = -1; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + to = (rank + 1) % size; + from = (rank + size - 1) % size; + tagno = 201; + + /* Initialization of performance variables */ + + MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); + if (MPIT_result != MPI_SUCCESS) + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + + MPIT_result = MPI_T_pvar_get_index(flush_pvar_name, + MPI_T_PVAR_CLASS_GENERIC, + &flush_pvar_idx); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, " + "check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_session_create(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot create a session for \"%s\" pvar\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Allocating a new PVAR in a session will reset the counters */ + + MPIT_result = MPI_T_pvar_handle_alloc(session, flush_pvar_idx, + MPI_COMM_WORLD, + &flush_handle, + &count); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, " + "check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* First phase: make a token circulated in MPI_COMM_WORLD */ + + MPIT_result = MPI_T_pvar_start(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, " + "check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + if (rank == 0) { + n = 25; + MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD,&request); + } + while (1) { + MPI_Irecv(&n, 1, MPI_INT, from, tagno, MPI_COMM_WORLD, &request); + MPI_Wait(&request, &status); + if (rank == 0) {n--;tagno++;} + MPI_Isend(&n, 1, MPI_INT, to, tagno, MPI_COMM_WORLD, &request); + if (rank != 0) {n--;tagno++;} + if (n<0){ + break; + } + } + + /* + * Build one file per processes + * Every thing that has been monitored by each + * process since the last flush will be output in filename + * + * Requires directory prof to be created. + * Filename format should display the phase number + * and the process rank for ease of parsing with + * aggregate_profile.pl script + */ + + sprintf(filename,"prof/phase_1"); + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, + filename) ) + { + fprintf(stderr, + "Process %d cannot save monitoring in %s.%d.prof\n", + rank, filename, rank); + } + + /* Force the writing of the monitoring data */ + + MPIT_result = MPI_T_pvar_stop(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, " + "check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* + * Don't set a filename. If we stop the session before setting + * it, then no output will be generated. + */ + + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, + &nullbuff) ) + { + fprintf(stderr, + "Process %d cannot save monitoring in %s\n", + rank, filename); + } + + (void)MPI_T_finalize(); + + MPI_Finalize(); + + return EXIT_SUCCESS; +} +\end{verbatim} + +\paragraph{prof/phase\_1.0.prof} + +\begin{verbatim} +# POINT TO POINT +E 0 1 108 bytes 27 msgs sent 0,0,0,27,0,[...],0 +# OSC +# COLLECTIVES +D MPI_COMM_WORLD procs: 0,1,2,3 +O2A 0 0 bytes 0 msgs sent +A2O 0 0 bytes 0 msgs sent +A2A 0 0 bytes 0 msgs sent +\end{verbatim} + +As it show on the sample profiling, for each kind of communication +(point-to-point, one-sided and collective), you find all the related +informations. There is one line per peers communicating. Each line +start with a lettre describing the kind of communication, such as +follows: + +\begin{description} +\item [{\tt E}] External messages, i.e. issued by the user +\item [{\tt I}] Internal messages, i.e. issued by the library +\item [{\tt S}] Sent one-sided messages, i.e. writing access to the remote memory +\item [{\tt R}] Received one-sided messages, i.e. reading access to the remote memory +\item [{\tt C}] Collective messages +\end{description} + +This letter is followed by the rank of the issuing process, and the +rank of the receiving one. Then you have the total amount in bytes +exchanged and the count of messages. For point-to-point entries +(i.e. {\tt E} of {\tt I} entries), the line is completed by the full +distribution of messages in the form of a histogram. See variable {\tt + size\brkunds{}histogram} in +Section~\ref{subsubsec:TDI-common-monitoring} for the corresponding +values. In the case of a disabled filtering between external and +internal messages, the {\tt I} lines are merged with the {\tt E} +lines, keeping the {\tt E} header. + +The end of the summary is a per communicator information, where you +find the name of the communicator, the ranks of the processes included +in this communicator, and the amount of data send (or received) for +each kind of collective, with the corresponding count of operations of +each kind. + +\subsubsection{Accessing Monitoring Performance Variables} + +The following snippet presents how to access the performances +variables defined as part of the \mpit{} interface. The session +allocation is not presented as it is the same as in the previous +example. Please note that contrary to the {\it + pml\brkunds{}monitoring\brkunds{}flush} variable, the class of the +monitoring performance values is {\tt + MPI\brkunds{}T\brkunds{}PVAR\brkunds{}CLASS\brkunds{}SIZE}, whereas +the {\it flush} variable is of class {\tt GENERIC}. Also, performances +variables are only to be read. + +\paragraph{test/monitoring/example\_reduce\_count.c} (extract) + +\begin{verbatim} +MPI_T_pvar_handle count_handle; +int count_pvar_idx; +const char count_pvar_name[] = "pml_monitoring_messages_count"; +size_t*counts; + +/* Retrieve the proper pvar index */ +MPIT_result = MPI_T_pvar_get_index(count_pvar_name, MPI_T_PVAR_CLASS_SIZE, &count_pvar_idx); +if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} + +/* Allocating a new PVAR in a session will reset the counters */ +MPIT_result = MPI_T_pvar_handle_alloc(session, count_pvar_idx, + MPI_COMM_WORLD, &count_handle, &count); +if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} + +counts = (size_t*)malloc(count * sizeof(size_t)); + +MPIT_result = MPI_T_pvar_start(session, count_handle); +if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} + +/* Token Ring communications */ +if (rank == 0) { + n = 25; + MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD,&request); +} +while (1) { + MPI_Irecv(&n, 1, MPI_INT, from, tagno, MPI_COMM_WORLD, &request); + MPI_Wait(&request, &status); + if (rank == 0) {n--;tagno++;} + MPI_Isend(&n, 1, MPI_INT, to, tagno, MPI_COMM_WORLD, &request); + if (rank != 0) {n--;tagno++;} + if (n<0){ + break; + } +} + +MPIT_result = MPI_T_pvar_read(session, count_handle, counts); +if (MPIT_result != MPI_SUCCESS) { + printf("failed to read handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} + +/* Global reduce so everyone knows the maximum messages sent to each rank */ +MPI_Allreduce(MPI_IN_PLACE, counts, count, MPI_UNSIGNED_LONG, MPI_MAX, MPI_COMM_WORLD); + +/* OPERATIONS ON COUNTS */ +... + +free(counts); + +MPIT_result = MPI_T_pvar_stop(session, count_handle); +if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} + +MPIT_result = MPI_T_pvar_handle_free(session, &count_handle); +if (MPIT_result != MPI_SUCCESS) { + printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); +} +\end{verbatim} + +\section{Technical Documentation of the Implementation} +\label{sec:TDI} + +This section describes the technical details of the components +implementation. It is of no use from a user point of view but it is made +to facilitate the work for future developer that would debug or enrich +the monitoring components. + +The architecture of this component is as follows. The Common component +is the main part where the magic occurs. PML, OSC and COLL components +are the entry points to the monitoring tool from the software stack +point-of-view. The relevant files can be found in accordance with +the partial directory tree presented in Figure~\ref{fig:tree}. + +\begin{figure} + \dirtree{% + .1 ompi/mca/. + .2 common. + .3 monitoring. + .4 common\_monitoring.h. + .4 common\_monitoring.c. + .4 common\_monitoring\_coll.h. + .4 common\_monitoring\_coll.c. + .4 HowTo\_pml\_monitoring.tex. + .4 Makefile.am. + .2 pml. + .3 monitoring. + .4 pml\_monitoring.h. + .4 pml\_monitoring\_component.c. + .4 pml\_monitoring\_comm.c. + .4 pml\_monitoring\_irecv.c. + .4 pml\_monitoring\_isend.c. + .4 pml\_monitoring\_start.c. + .4 pml\_monitoring\_iprobe.c. + .4 Makefile.am. + .2 osc. + .3 monitoring. + .4 osc\_monitoring.h. + .4 osc\_monitoring\_component.c. + .4 osc\_monitoring\_comm.h. + .4 osc\_monitoring\_module.h. + .4 osc\_monitoring\_dynamic.h. + .4 osc\_monitoring\_template.h. + .4 osc\_monitoring\_accumulate.h. + .4 osc\_monitoring\_active\_target.h. + .4 osc\_monitoring\_passive\_target.h. + .4 configure.m4. + .4 Makefile.am. + .2 coll. + .3 monitoring. + .4 coll\_monitoring.h. + .4 coll\_monitoring\_component.c. + .4 coll\_monitoring\_bcast.c. + .4 coll\_monitoring\_reduce.c. + .4 coll\_monitoring\_barrier.c. + .4 coll\_monitoring\_alltoall.c. + .4 {...} . + .4 Makefile.am. + } +\caption{Monitoring component files architecture (partial)} +\label{fig:tree} +\end{figure} + +\subsection{Common} +\label{subsec:TDI-common} +This part of the monitoring components is the place where data is +managed. It centralizes all recorded information, the translation +hash-table and ensures a unique initialization of the monitoring +structures. This component is also the one where the MCA variables (to +be set as part of the command line) are defined and where the final +output, if any requested, is dealt with. + +The header file defines the unique monitoring version number, +different preprocessing macros for printing information using the +monitoring output stream object, and the ompi monitoring API (i.e. the +API to be used INSIDE the ompi software stack, not the one to be +exposed to the end-user). It has to be noted that the {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}record\brkunds{}*} +functions are to be used with the destination rank translated into the +corresponding rank in {\tt MPI\brkunds{}COMM\brkunds{}WORLD}. This +translation is done by using {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}get\brkunds{}world\brkunds{}rank}. The +use of this function may be limited by how the initialization occurred +(see in~\ref{subsec:TDI-pml}). + +\subsubsection{Common monitoring} +\label{subsubsec:TDI-common-monitoring} + +The the common\brkunds{}monitoring.c file defines multiples variables +that has the following use: +\begin{description} +\item[{\tt mca\brkunds{}common\brkunds{}monitoring\brkunds{}hold}] is + the counter that keeps tracks of whether the common component has + already been initialized or if it is to be released. The operations + on this variable are atomic to avoid race conditions in a + multi-threaded environment. +\item[{\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}output\brkunds{}stream\brkunds{}obj}] + is the structure used internally by \ompi{} for output streams. The + monitoring output stream states that this output is for debug, so + the actual output will only happen when OPAL is configured with {\tt + -{}-enable-debug}. The output is sent to stderr standard output + stream. The prefix field, initialized in {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}init}, states + that every log message emitted from this stream object will be + prefixed by ``{\tt [hostname:PID] monitoring: }'', where {\tt + hostname} is the configured name of the machine running the + process and {\tt PID} is the process id, with 6 digits, prefixed + with zeros if needed. +\item[{\tt mca\brkunds{}common\brkunds{}monitoring\brkunds{}enabled}] + is the variable retaining the original value given to the MCA option + system, as an example as part of the command line. The corresponding + variable is {\tt pml\brkunds{}monitoring\brkunds{}enable}. This + variable is not to be written by the monitoring component. It is + used to reset the {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}current\brkunds{}state} + variable between phases. The value given to this parameter also + defines whether or not the filtering between internal and externals + messages is enabled. +\item[{\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}current\brkunds{}state}] + is the variable used to determine the actual current state of the + monitoring. This variable is the one used to define phases. +\item[{\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}output\brkunds{}enabled}] + is a variable, set by the MCA engine, that states whether or not the + user requested a summary of the monitored data to be streamed out at + the end of the execution. It also states whether the output should + be to stdout, stderr or to a file. If a file is requested, the next + two variables have to be set. The corresponding variable is {\tt + pml\brkunds{}monitoring\brkunds{}enable\brkunds{}output}. {\bf + Warning:} This variable may be set to 0 in case the monitoring is + also controlled with \mpit{}. We cannot both control the monitoring + via \mpit{} and expect accurate answer upon {\tt + MPI\brkunds{}Finalize}. +\item[{\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}initial\brkunds{}filename}] + works the same as {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}ena\allowbreak{}bled}. This + variable is, and has to be, only used as a placeholder for the {\tt + pml\brkunds{}monitoring\allowbreak\brkunds{}filename} + variable. This variable has to be handled very carefully as it has + to live as long as the program and it has to be a valid pointer + address, which content is not to be released by the component. The + way MCA handles variable (especially strings) makes it very easy to + create segmentation faults. But it deals with the memory release of + the content. So, in the end, {\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}initial\brkunds{}filename} + is just to be read. +\item[{\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}current\brkunds{}filename}] + is the variable the monitoring component will work with. This + variable is the one to be set by \mpit{'s} control variable {\tt + pml\brkunds{}monitoring\brkunds{}flush}. Even though this control + variable is prefixed with {\tt pml} for historical and easy reasons, + it depends on the common section for its behavior. +\item[{\tt pml\brkunds{}data} and {\tt pml\brkunds{}count}] arrays of + unsigned 64-bits integers record respectively the cumulated amount + of bytes sent from the current process to another process $p$, and + the count of messages. The data in this array at the index $i$ + corresponds to the data sent to the process $p$, of id $i$ in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. These arrays are of size $N$, + where $N$ is the number of nodes in the MPI application. If the + filtering is disabled, these variables gather all information + regardless of the tags. In this case, the next two arrays are, + obviously, not used, even though they will still be allocated. The + {\tt pml\brkunds{}data} and {\tt pml\brkunds{}count} arrays, and the + nine next arrays described, are allocated, initialized, reset and + freed all at once, and are concurrent in the memory. +\item[{\tt filtered\brkunds{}pml\brkunds{}data} and {\tt + filtered\brkunds{}pml\brkunds{}count}] arrays of unsigned 64-bits + integers record respectively the cumulated amount of bytes sent from + the current process to another process $p$, and the count of + internal messages. The data in this array at the index $i$ + corresponds to the data sent to the process $p$, of id $i$ in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. These arrays are of size $N$, + where $N$ is the number of nodes in the MPI application. The + internal messages are defined as messages sent through the PML + layer, with a negative tag. They are issued, as an example, from the + decomposition of collectives operations. +\item[{\tt osc\brkunds{}data\brkunds{}s} and {\tt + osc\brkunds{}count\brkunds{}s}] arrays of unsigned 64-bits + integers record respectively the cumulated amount of bytes sent from + the current process to another process $p$, and the count of + messages. The data in this array at the index $i$ corresponds to the + data sent to the process $p$, of id $i$ in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. These arrays are of size $N$, + where $N$ is the number of nodes in the MPI application. +\item[{\tt osc\brkunds{}data\brkunds{}r} and {\tt + osc\brkunds{}count\brkunds{}r}] arrays of unsigned 64-bits + integers record respectively the cumulated amount of bytes received + to the current process to another process $p$, and the count of + messages. The data in this array at the index $i$ corresponds to the + data sent to the process $p$, of id $i$ in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. These arrays are of size $N$, + where $N$ is the number of nodes in the MPI application. +\item[{\tt coll\brkunds{}data} and {\tt coll\brkunds{}count}] arrays + of unsigned 64-bits integers record respectively the cumulated + amount of bytes sent from the current process to another process + $p$, in the case of a all-to-all or one-to-all operations, or + received from another process $p$ to the current process, in the + case of all-to-one operations, and the count of messages. The data + in this array at the index $i$ corresponds to the data sent to the + process $p$, of id $i$ in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. These arrays are of size $N$, + where $N$ is the number of nodes in the MPI application. The + communications are thus considered symmetrical in the resulting + matrices. +\item[{\tt size\brkunds{}histogram}] array of unsigned 64-bits + integers records the distribution of sizes of pml messages, filtered + or not, between the current process and a process $p$. This + histogram is of log-2 scale. The index 0 is for empty + messages. Messages of size between 1 and $2^{64}$ are recorded such + as the following. For a given size $S$, with $2^k \le S < 2^{k+1}$, + the $k$-th element of the histogram is incremented. This array is of + size $N \times {\tt max\brkunds{}size\brkunds{}histogram}$, where + $N$ is the number of nodes in the MPI application. +\item[{\tt max\brkunds{}size\brkunds{}histogram}] constant value + correspond to the number of elements in the {\tt + size\brkunds{}histo\allowbreak{}gram} array for each processor. It + is stored here to avoid having its value hang here and there in the + code. This value is used to compute the total size of the array to + be allocated, initialized, reset or freed. This value equals $(10 + + {\tt max\brkunds{}size\brkunds{}histogram}) \times N$, where $N$ + correspond to the number of nodes in the MPI application. This value + is also used to compute the index to the histogram of a given + process $p$ ; this index equals $i \times {\tt + max\brkunds{}size\brkunds{}histogram}$, where $i$ is $p$'s id in + {\tt MPI\brkunds{}COMM\brkunds{}WORLD}. +\item[{\tt log10\brkunds{}2}] is a cached value for the common + logarithm (or decimal logarithm) of 2. This value is used to compute + the index at which increment the histogram value. This index $j$, + for a message that is not empty, is computed as follow $j = 1 + + \left \lfloor{log_{10}(S)/log_{10}(2)} \right \rfloor$, where + $log_{10}$ is the decimal logarithm and $S$ the size of the message. +\item[{\tt rank\brkunds{}world}] is the cached value of the current + process in {\tt MPI\brkunds{}COMM\brkunds{}WORLD}. +\item[{\tt nprocs\brkunds{}world}] is the cached value of the size of + {\tt MPI\brkunds{}COMM\brkunds{}WORLD}. +\item[{\tt + common\brkunds{}monitoring\brkunds{}translation\brkunds{}ht}] is + the hash table used to translate the rank of any process $p$ of rank + $r$ from any communicator, into its rank in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}. It lives as long as the + monitoring components do. +\end{description} + +In any case, we never monitor communications between one process and +itself. + +The different functions to access \mpit{} performance variables are +pretty straight forward. Note that for PML, OSC and COLL, for both +count and size, performance variables the {\it notify} function is the +same. At binding, it sets the {\tt count} variable to the size of {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}, as requested by the MPI-3 +standard (for arrays, the parameter should be set to the number of +elements of the array). Also, the {\it notify} function is responsible +for starting the monitoring when any monitoring performance value +handle is started, and it also disable the monitoring when any +monitoring performance value handle is stopped. The {\it flush} +control variable behave as follows. On binding, it returns the size of +the filename defined if any, 0 otherwise. On start event, this +variable also enable the monitoring, as the performance variables do, +but it also disable the final output, even though it was previously +requested by the end-user. On the stop event, this variable flushes +the monitored data to the proper output stream (i.e. stdout, stderr or +the requested file). Note that these variables are to be bound only +with the {\tt MPI\brkunds{}COMM\brkunds{}WORLD} communicator. For far, +the behavior in case of a binding to another communicator is not +tested. + +For the flushing itself, it is decomposed into two functions. The +first one ({\tt + mca\brkunds{}common\brkunds{}monitoring\brkunds{}flush}) is +responsible for opening the proper stream. If it is given 0 as its +first parameter, it does nothing with no error propagated as it +correspond to a disable monitoring. The {\tt filename} parameter is +only taken in account if {\tt fd} is strictly greater than 2. Note +that upon flushing, the record arrays are reset to 0. Also, the +flushing called in {\it common\brkunds{}monitoring.c} call the +specific flushing for per communicator collectives monitoring data. + +For historical reasons, and because of the fact that the PML layer is +the first one to be loaded, MCA parameters and the {\it + monitoring\brkunds{}flush} control variable are linked to the PML +framework. The other performance variables, though, are linked to the +proper frameworks. + +\subsubsection{Common Coll Monitoring} +\label{subsubsec:TDI-common-coll} + +In addition to the monitored data kept in the arrays, the monitoring +component also provide a per communicator set of records. It keeps +pieces of information about collective operations. As we cannot know +how the data are indeed exchanged (see Section~\ref{subsec:TDI-coll}), +we added this complement to the final summary of the monitored +operations. + +We keep the per communicator data set as part of the {\it + coll\brkunds{}monitoring\brkunds{}module}. Each data set is also +kept in a hash table, with the communicator structure address as the +hash-key. This data set is made to keep tracks of the mount of data +sent through a communicator with collective operations and the count +of each kind of operations. It also cache the list of the processes' +ranks, translated to their rank in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD}, as a string, the rank of the +current process, translated into its rank in {\tt + MPI\brkunds{}COMM\brkunds{}WORLD} and the communicator's name. + +The process list is generated with the following algorithm. First, we +allocate a string long enough to contain it. We define long enough as +$1 + (d + 2) \times s$, where $d$ is the number of digit of the higher +rank in {\tt MPI\brkunds{}COMM\brkunds{}WORLD} and $s$ the size of the +current communicator. We add 2 to $d$, to consider the space needed +for the comma and the space between each rank, and 1 to ensure there +is enough room for the NULL character terminating the string. Then, we +fill the string with the proper values, and adjust the final size of +the string. + +When possible, this process happen when the communicator is being +created. If it fails, this process will be tested again when the +communicator is being released. + +This data set lifetime is different from the one of its corresponding +communicator. It is actually destroyed only once its data had been +flushed (at the end of the execution or at the end of a monitoring +phase). To this end, this structure keeps a flag to know if it is safe +to release it or not. + +\subsection{PML} +\label{subsec:TDI-pml} + +As specified in Section~\ref{subsubsec:TDI-common-monitoring}, this +component is closely working with the common component. They were +merged initially, but separated later in order to propose a cleaner +and more logical architecture. + +This module is the first one to be initialized by the \ompi{} software +stack ; thus it is the one responsible for the proper initialization, +as an example, of the translation hash table. \ompi{} relies on the +PML layer to add process logical structures as far as communicators +are concerned. + +To this end, and because of the way the PML layer is managed by the +MCA engine, this component has some specific variables to manage its +own state, in order to be properly instantiated. The module selection +process works as follows. All the PML modules available for the +framework are loaded, initialized and asked for a priority. The higher +the priority, the higher the odds to be selected. This is why our +component returns a priority of 0. Note that the priority is returned +and initialization of the common module is done at this point only if +the monitoring had been requested by the user. + +% CF - TODO: check what happen if the monitoring is the only PML module available. +If everything works properly, we should not be selected. The next step +in the PML initialization is to finalize every module that is not the +selected one, and then close components that were not used. At this +point the winner component and its module are saved for the PML. The +variables {\tt + mca\brkunds{}pml\brkunds{}base\brkunds{}selected\brkunds{}component} +and {\tt mca\brkunds{}pml}, defined in {\it + ompi/mca/pml/base/pml\brkunds{}base\brkunds{}frame.c}, are now +initialized. This point is the one where we install our interception +layer. We also indicate ourself now initialized, in order to know on +the next call to the {\it component\brkunds{}close} function that we +actually have to be closed this time. Note that the adding of our +layer require the add of the {\tt + MCA\brkunds{}PML\brkunds{}BASE\brkunds{}FLAG\brkunds{}REQUIRE\brkunds{}WORLD} +flag in order to request for the whole list of processes to be given +at the initialization of {\tt MPI\brkunds{}COMM\brkunds{}WORLD}, so we +can properly fill our hash table. The downside of this trick is that +it stops the \ompi{} optimization of lazily adding them. + +Once that is done, we are properly installed, and we can monitor every +messages going through the PML layer. As we only monitor messages from +the emitter side, we only actually record when the messages are using +the {\tt MPI\brkunds{}Send}, {\tt MPI\brkunds{}Isend} or {\tt + MPI\brkunds{}Start} functions. + +\subsection{OSC} +\label{subsec:TDI-osc} + +This layer is responsible for remote memory access operations, and +thus, it has its specificities. Even though the component selection +process is quite close to the PML selection's one, there are some +aspects on the usage of OSC modules that had us to adapt the +interception layer. + +The first problem comes from how the module is accessed inside the +components. In the OSC layer, the module is part of the {\tt + ompi\brkunds{}win\brkunds{}t} structure. This implies that it is +possible to access directly to the proper field of the structure to +find the reference to the module. And it how it is done. Because of +that it is not possible to directly replace a module with ours that +would have saved the original module. The first solution was then to +``extend'' (in the ompi manner of extending {\it objects}) with a +structure that would have contain as the first field a union type of +every possible module. We would have then copy their fields values, +save their functions, and replace them with pointers to our inception +functions. This solution was implemented but a second problem was +faced, stopping us from going with this solution. + +The second problem was that the {\it osc/rdma} uses internally a hash +table to keep tracks of its modules and allocated segments, with the +module's pointer address as the hash key. Hence, it was not possible +for us to modify this address, as the RDMA module would not be able to +find the corresponding segments. This also implies that it is neither +possible for us to extend the structures. Therefore, we could only +modify the common fields of the structures to keep our ``module'' +adapted to any OSC component. We designed templates, dynamically +adapted for each kind of module. + +To this end and for each kind of OSC module, we generate and +instantiate three variables: +\begin{description} +\item[{\tt + OMPI\brkunds{}OSC\brkunds{}MONITORING\brkunds{}MODULE\brkunds{}VARIABLE(template)}] + is the structure that keeps the address of the original module + functions of a given component type (i.e. RDMA, PORTALS4, PT2PT or + SM). It is initialized once, and referred to to propagate the calls + after the initial interception. There is one generated for each kind + of OSC component. +\item[{\tt + OMPI\brkunds{}OSC\brkunds{}MONITORING\brkunds{}MODULE\brkunds{}INIT(template)}] + is a flag to ensure the module variable is only initialized once, in + order to avoid race conditions. There is one generated for each {\tt + OMPI\brkunds{}OSC\brkunds{}MONITORING\brkunds{}MODULE\brkunds{}VARIABLE(template)}, + thus one per kind of OSC component. +\item[{\tt + OMPI\brkunds{}OSC\brkunds{}MONITORING\brkunds{}TEMPLATE\brkunds{}VARIABLE(template)}] + is a structure containing the address of the interception + functions. There is one generated for each kind of OSC component. +\end{description} + +The interception is done with the following steps. First, we follow +the selecting process. Our priority is set to {\tt INT\brkunds{}MAX} +in order to ensure that we would be the selected component. Then we do +this selection ourselves. This gives us the opportunity to modify as +needed the communication module. If it is the first time a module of +this kind of component is used, we extract from the given module the +function's addresses and save them to the {\tt + OMPI\brkunds{}OSC\brkunds{}MONITORING\brkunds{}MODULE\brkunds{}VARIABLE(template)} +structure, after setting the initialization flag. Then we replace the +origin functions in the module with our interception ones. + +To make everything work for each kind of component, the variables are +generated with the corresponding interception functions. These +operations are done at compilation time. An issue appeared with the +use of PORTALS4, that have its symbols propagated only when the card +are available on the system. In the header files, where we define the +template functions and structures, {\it template} refers to the OSC +component name. + +We found two drawbacks to this solution. First, the readability of the +code is bad. Second, is that this solution is not auto-adaptive to new +components. If a new component is added, the code in {\it + ompi/mca/osc/monitoring/osc\brkunds{}monitoring\brkunds{}component.c} +needs to be modified in order to monitor the operations going through +it. Even though the modification is three lines long, it my be +preferred to have the monitoring working without any modification +related to other components. + +A second solution for the OSC monitoring could have been the use of a +hash table. We would have save in the hash table the structure +containing the original function's addresses, with the module address +as a hash key. Our interception functions would have then search in +the hash table the corresponding structure on every call, in order to +propagate the functions calls. This solution was not implemented +because because it offers an higher memory footprint for a large +amount of windows allocated. Also, the cost of our interceptions would +have been then higher, because of the search in the hash table. This +reason was the main reason we choose the first solution. The OSC layer +is designed to be very cost-effective in order to take the best +advantages of the background communication and +communication/computations overlap. This solution would have however +give us the adaptability our solution lacks. + +\subsection{COLL} +\label{subsec:TDI-coll} + +The collective module (or to be closer to the reality, {\it modules}) +is part of the communicator. The modules selection is made with the +following algorithm. First all available components are selected, +queried and sorted in ascending order of priorities. The modules may +provide part or all operations, keeping in mind that modules with +higher priority may take your place. The sorted list of module is +iterated over, and for each module, for each operation, if the +function's address is not {\tt NULL}, the previous module is replace +with the current one, and so is the corresponding function. Every time +a module is selected it is retained and enabled (i.e. the {\tt + coll\brkunds{}module\brkunds{}enable} function is called), and every +time it gets replaced, it is disabled (i.e. the {\tt + coll\brkunds{}module\brkunds{}disable} function is called) and +released. + +When the monitoring module is queried, the priority returned is {\tt + INT\brkunds{}MAX} to ensure that our module comes last in the +list. Then, when enabled, all the previous function-module couples are +kept as part of our monitoring module. The modules are retained to +avoid having the module freed when released by the selecting +process. To ensure the error detection in communicator (i.e. an +incomplete collective API), if, for a given operation, there is no +corresponding module given, we set this function's address to {\tt + NULL}. Symmetrically, when our module is released, we also propagate +this call to each underlying module, and we also release the +objects. Also, when the module is enabled, we initialize the per +communicator data record, which gets released when the module is +disabled. + +When an collective operation is called, both blocking or non blocking, +we intercept the call and record the data in two different +entries. The operations are groups between three kinds. One-to-all +operations, all-to-one operations and all-to-all operations. + +For one-to-all operations, the root process of the operation computes +the total amount of data to be sent, and keep it as part of the per +communicator data (see Section~\ref{subsubsec:TDI-common-coll}). Then +it update the {\it common\brkunds{}monitoring} array with the amount +of data each pair has to receive in the end. As we cannot predict the +actual algorithm used to communicate the data, we assume the root send +everything directly to each process. + +For all-to-one operations, each non-root process compute the amount of +data to send to the root and update the {\it common\brkunds{}monitoring} +array with the amount of data at the index $i$, with $i$ being the +rank in {\tt MPI\brkunds{}COMM\brkunds{}WORLD} of the root process. As +we cannot predict the actual algorithm used to communicate the data, +we assume each process send its data directly to the root. The root +process compute the total amount of data to receive and update the per +communicator data. + +For all-to-all operations, each process compute for each other process +the amount of data to both send and receive from it. The amount of +data to be sent to each process $p$ is added to update the {\it + common\brkunds{}monitoring} array at the index $i$, with $i$ being +the rank of $p$ in {\tt MPI\brkunds{}COMM\brkunds{}WORLD}. The total +amount of data sent by a process is also added to the per communicator +data. + +For every rank translation, we use the {\tt + common\brkunds{}monitoring\brkunds{}translation\brkunds{}ht} hash +table. + +\end{document} diff --git a/ompi/mca/common/monitoring/Makefile.am b/ompi/mca/common/monitoring/Makefile.am new file mode 100644 index 00000000000..b857feecf8a --- /dev/null +++ b/ompi/mca/common/monitoring/Makefile.am @@ -0,0 +1,50 @@ +# +# Copyright (c) 2016 Inria. All rights reserved. +# Copyright (c) 2017 Research Organization for Information Science +# and Technology (RIST). All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +sources = common_monitoring.c common_monitoring_coll.c +headers = common_monitoring.h common_monitoring_coll.h + +lib_LTLIBRARIES = +noinst_LTLIBRARIES = +component_install = libmca_common_monitoring.la +component_noinst = libmca_common_monitoring_noinst.la + +if MCA_BUILD_ompi_common_monitoring_DSO +lib_LTLIBRARIES += $(component_install) +else +noinst_LTLIBRARIES += $(component_noinst) +endif + +libmca_common_monitoring_la_SOURCES = $(headers) $(sources) +libmca_common_monitoring_la_CPPFLAGS = $(common_monitoring_CPPFLAGS) +libmca_common_monitoring_la_LDFLAGS = \ + $(common_monitoring_LDFLAGS) +libmca_common_monitoring_la_LIBADD = $(common_monitoring_LIBS) +libmca_common_monitoring_noinst_la_SOURCES = $(headers) $(sources) + +# These two rules will sym link the "noinst" libtool library filename +# to the installable libtool library filename in the case where we are +# compiling this component statically (case 2), described above). +V=0 +OMPI_V_LN_SCOMP = $(ompi__v_LN_SCOMP_$V) +ompi__v_LN_SCOMP_ = $(ompi__v_LN_SCOMP_$AM_DEFAULT_VERBOSITY) +ompi__v_LN_SCOMP_0 = @echo " LN_S " `basename $(component_install)`; + +all-local: + $(OMPI_V_LN_SCOMP) if test -z "$(lib_LTLIBRARIES)"; then \ + rm -f "$(component_install)"; \ + $(LN_S) "$(component_noinst)" "$(component_install)"; \ + fi + +clean-local: + if test -z "$(lib_LTLIBRARIES)"; then \ + rm -f "$(component_install)"; \ + fi diff --git a/ompi/mca/pml/monitoring/README b/ompi/mca/common/monitoring/README similarity index 100% rename from ompi/mca/pml/monitoring/README rename to ompi/mca/common/monitoring/README diff --git a/ompi/mca/common/monitoring/common_monitoring.c b/ompi/mca/common/monitoring/common_monitoring.c new file mode 100644 index 00000000000..68d8c8ab5be --- /dev/null +++ b/ompi/mca/common/monitoring/common_monitoring.c @@ -0,0 +1,795 @@ +/* + * Copyright (c) 2013-2017 The University of Tennessee and The University + * of Tennessee Research Foundation. All rights + * reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. + * Copyright (c) 2015 Bull SAS. All rights reserved. + * Copyright (c) 2016-2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include "common_monitoring.h" +#include "common_monitoring_coll.h" +#include +#include +#include +#include +#include +#include + +#if SIZEOF_LONG_LONG == SIZEOF_SIZE_T +#define MCA_MONITORING_VAR_TYPE MCA_BASE_VAR_TYPE_UNSIGNED_LONG_LONG +#elif SIZEOF_LONG == SIZEOF_SIZE_T +#define MCA_MONITORING_VAR_TYPE MCA_BASE_VAR_TYPE_UNSIGNED_LONG +#endif + +/*** Monitoring specific variables ***/ +/* Keep tracks of how many components are currently using the common part */ +static int32_t mca_common_monitoring_hold = 0; +/* Output parameters */ +int mca_common_monitoring_output_stream_id = -1; +static opal_output_stream_t mca_common_monitoring_output_stream_obj = { + .lds_verbose_level = 0, + .lds_want_syslog = false, + .lds_prefix = NULL, + .lds_suffix = NULL, + .lds_is_debugging = true, + .lds_want_stdout = false, + .lds_want_stderr = true, + .lds_want_file = false, + .lds_want_file_append = false, + .lds_file_suffix = NULL +}; + +/*** MCA params to mark the monitoring as enabled. ***/ +/* This signals that the monitoring will highjack the PML, OSC and COLL */ +int mca_common_monitoring_enabled = 0; +int mca_common_monitoring_current_state = 0; +/* Signals there will be an output of the monitored data at component close */ +static int mca_common_monitoring_output_enabled = 0; +/* File where to output the monitored data */ +static char* mca_common_monitoring_initial_filename = ""; +static char* mca_common_monitoring_current_filename = NULL; + +/* array for stroring monitoring data*/ +static size_t* pml_data = NULL; +static size_t* pml_count = NULL; +static size_t* filtered_pml_data = NULL; +static size_t* filtered_pml_count = NULL; +static size_t* osc_data_s = NULL; +static size_t* osc_count_s = NULL; +static size_t* osc_data_r = NULL; +static size_t* osc_count_r = NULL; +static size_t* coll_data = NULL; +static size_t* coll_count = NULL; + +static size_t* size_histogram = NULL; +static const int max_size_histogram = 66; +static double log10_2 = 0.; + +static int rank_world = -1; +static int nprocs_world = 0; + +opal_hash_table_t *common_monitoring_translation_ht = NULL; + +/* Reset all the monitoring arrays */ +static void mca_common_monitoring_reset ( void ); + +/* Flushes the monitored data and reset the values */ +static int mca_common_monitoring_flush (int fd, char* filename); + +/* Retreive the PML recorded count of messages sent */ +static int mca_common_monitoring_get_pml_count (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the PML recorded amount of data sent */ +static int mca_common_monitoring_get_pml_size (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the OSC recorded count of messages sent */ +static int mca_common_monitoring_get_osc_sent_count (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the OSC recorded amount of data sent */ +static int mca_common_monitoring_get_osc_sent_size (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the OSC recorded count of messages received */ +static int mca_common_monitoring_get_osc_recv_count (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the OSC recorded amount of data received */ +static int mca_common_monitoring_get_osc_recv_size (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the COLL recorded count of messages sent */ +static int mca_common_monitoring_get_coll_count (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Retreive the COLL recorded amount of data sent */ +static int mca_common_monitoring_get_coll_size (const struct mca_base_pvar_t *pvar, + void *value, void *obj_handle); + +/* Set the filename where to output the monitored data */ +static int mca_common_monitoring_set_flush(struct mca_base_pvar_t *pvar, + const void *value, void *obj); + +/* Does nothing, as the pml_monitoring_flush pvar has no point to be read */ +static int mca_common_monitoring_get_flush(const struct mca_base_pvar_t *pvar, + void *value, void *obj); + +/* pml_monitoring_count, pml_monitoring_size, + osc_monitoring_sent_count, osc_monitoring sent_size, + osc_monitoring_recv_size and osc_monitoring_recv_count pvar notify + function */ +static int mca_common_monitoring_comm_size_notify(mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj_handle, int *count); + +/* pml_monitoring_flush pvar notify function */ +static int mca_common_monitoring_notify_flush(struct mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj, int *count); + +static int mca_common_monitoring_set_flush(struct mca_base_pvar_t *pvar, + const void *value, void *obj) +{ + if( NULL != mca_common_monitoring_current_filename ) { + free(mca_common_monitoring_current_filename); + } + if( NULL == *(char**)value || 0 == strlen((char*)value) ) { /* No more output */ + mca_common_monitoring_current_filename = NULL; + } else { + mca_common_monitoring_current_filename = strdup((char*)value); + if( NULL == mca_common_monitoring_current_filename ) + return OMPI_ERROR; + } + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_flush(const struct mca_base_pvar_t *pvar, + void *value, void *obj) +{ + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_notify_flush(struct mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj, int *count) +{ + switch (event) { + case MCA_BASE_PVAR_HANDLE_BIND: + mca_common_monitoring_reset(); + *count = (NULL == mca_common_monitoring_current_filename + ? 0 : strlen(mca_common_monitoring_current_filename)); + case MCA_BASE_PVAR_HANDLE_UNBIND: + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_START: + mca_common_monitoring_current_state = mca_common_monitoring_enabled; + mca_common_monitoring_output_enabled = 0; /* we can't control the monitoring via MPIT and + * expect accurate answer upon MPI_Finalize. */ + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_STOP: + return mca_common_monitoring_flush(3, mca_common_monitoring_current_filename); + } + return OMPI_ERROR; +} + +static int mca_common_monitoring_comm_size_notify(mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj_handle, + int *count) +{ + switch (event) { + case MCA_BASE_PVAR_HANDLE_BIND: + /* Return the size of the communicator as the number of values */ + *count = ompi_comm_size ((ompi_communicator_t *) obj_handle); + case MCA_BASE_PVAR_HANDLE_UNBIND: + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_START: + mca_common_monitoring_current_state = mca_common_monitoring_enabled; + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_STOP: + mca_common_monitoring_current_state = 0; + return OMPI_SUCCESS; + } + + return OMPI_ERROR; +} + +void mca_common_monitoring_init( void ) +{ + if( mca_common_monitoring_enabled && + 1 < opal_atomic_add_32(&mca_common_monitoring_hold, 1) ) return; /* Already initialized */ + + char hostname[OPAL_MAXHOSTNAMELEN] = "NA"; + /* Initialize constant */ + log10_2 = log10(2.); + /* Open the opal_output stream */ + gethostname(hostname, sizeof(hostname)); + asprintf(&mca_common_monitoring_output_stream_obj.lds_prefix, + "[%s:%06d] monitoring: ", hostname, getpid()); + mca_common_monitoring_output_stream_id = + opal_output_open(&mca_common_monitoring_output_stream_obj); + /* Initialize proc translation hashtable */ + common_monitoring_translation_ht = OBJ_NEW(opal_hash_table_t); + opal_hash_table_init(common_monitoring_translation_ht, 2048); +} + +void mca_common_monitoring_finalize( void ) +{ + if( ! mca_common_monitoring_enabled || /* Don't release if not last */ + 0 < opal_atomic_sub_32(&mca_common_monitoring_hold, 1) ) return; + + OPAL_MONITORING_PRINT_INFO("common_component_finish"); + /* Dump monitoring informations */ + mca_common_monitoring_flush(mca_common_monitoring_output_enabled, + mca_common_monitoring_current_filename); + /* Disable all monitoring */ + mca_common_monitoring_enabled = 0; + /* Close the opal_output stream */ + opal_output_close(mca_common_monitoring_output_stream_id); + free(mca_common_monitoring_output_stream_obj.lds_prefix); + /* Free internal data structure */ + free(pml_data); /* a single allocation */ + opal_hash_table_remove_all( common_monitoring_translation_ht ); + OBJ_RELEASE(common_monitoring_translation_ht); + mca_common_monitoring_coll_finalize(); + if( NULL != mca_common_monitoring_current_filename ) { + free(mca_common_monitoring_current_filename); + mca_common_monitoring_current_filename = NULL; + } +} + +void mca_common_monitoring_register(void*pml_monitoring_component) +{ + /* Because we are playing tricks with the component close, we should not + * use mca_base_component_var_register but instead stay with the basic + * version mca_base_var_register. + */ + (void)mca_base_var_register("ompi", "pml", "monitoring", "enable", + "Enable the monitoring at the PML level. A value of 0 " + "will disable the monitoring (default). A value of 1 will " + "aggregate all monitoring information (point-to-point and " + "collective). Any other value will enable filtered monitoring", + MCA_BASE_VAR_TYPE_INT, NULL, MPI_T_BIND_NO_OBJECT, + MCA_BASE_VAR_FLAG_DWG, OPAL_INFO_LVL_4, + MCA_BASE_VAR_SCOPE_READONLY, + &mca_common_monitoring_enabled); + + mca_common_monitoring_current_state = mca_common_monitoring_enabled; + + (void)mca_base_var_register("ompi", "pml", "monitoring", "enable_output", + "Enable the PML monitoring textual output at MPI_Finalize " + "(it will be automatically turned off when MPIT is used to " + "monitor communications). This value should be different " + "than 0 in order for the output to be enabled (default disable)", + MCA_BASE_VAR_TYPE_INT, NULL, MPI_T_BIND_NO_OBJECT, + MCA_BASE_VAR_FLAG_DWG, OPAL_INFO_LVL_9, + MCA_BASE_VAR_SCOPE_READONLY, + &mca_common_monitoring_output_enabled); + + (void)mca_base_var_register("ompi", "pml", "monitoring", "filename", + /*&mca_common_monitoring_component.pmlm_version, "filename",*/ + "The name of the file where the monitoring information " + "should be saved (the filename will be extended with the " + "process rank and the \".prof\" extension). If this field " + "is NULL the monitoring will not be saved.", + MCA_BASE_VAR_TYPE_STRING, NULL, MPI_T_BIND_NO_OBJECT, + MCA_BASE_VAR_FLAG_DWG, OPAL_INFO_LVL_9, + MCA_BASE_VAR_SCOPE_READONLY, + &mca_common_monitoring_initial_filename); + + /* Now that the MCA variables are automatically unregistered when + * their component close, we need to keep a safe copy of the + * filename. + * Keep the copy completely separated in order to let the initial + * filename to be handled by the framework. It's easier to deal + * with the string lifetime. + */ + if( NULL != mca_common_monitoring_initial_filename ) + mca_common_monitoring_current_filename = strdup(mca_common_monitoring_initial_filename); + + /* Register PVARs */ + + /* PML PVARs */ + (void)mca_base_pvar_register("ompi", "pml", "monitoring", "flush", "Flush the monitoring " + "information in the provided file. The filename is append with " + "the .%d.prof suffix, where %d is replaced with the processus " + "rank in MPI_COMM_WORLD.", + OPAL_INFO_LVL_1, MCA_BASE_PVAR_CLASS_GENERIC, + MCA_BASE_VAR_TYPE_STRING, NULL, MPI_T_BIND_NO_OBJECT, 0, + mca_common_monitoring_get_flush, mca_common_monitoring_set_flush, + mca_common_monitoring_notify_flush, NULL); + + (void)mca_base_pvar_register("ompi", "pml", "monitoring", "messages_count", "Number of " + "messages sent to each peer through the PML framework.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_pml_count, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "pml", "monitoring", "messages_size", "Size of messages " + "sent to each peer in a communicator through the PML framework.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_pml_size, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + /* OSC PVARs */ + (void)mca_base_pvar_register("ompi", "osc", "monitoring", "messages_sent_count", "Number of " + "messages sent through the OSC framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_osc_sent_count, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "osc", "monitoring", "messages_sent_size", "Size of " + "messages sent through the OSC framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_osc_sent_size, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "osc", "monitoring", "messages_recv_count", "Number of " + "messages received through the OSC framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_osc_recv_count, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "osc", "monitoring", "messages_recv_size", "Size of " + "messages received through the OSC framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_osc_recv_size, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + /* COLL PVARs */ + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "messages_count", "Number of " + "messages exchanged through the COLL framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_coll_count, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "messages_size", "Size of " + "messages exchanged through the COLL framework with each peer.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_get_coll_size, NULL, + mca_common_monitoring_comm_size_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "o2a_count", "Number of messages " + "exchanged as one-to-all operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_COUNTER, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_o2a_count, NULL, + mca_common_monitoring_coll_messages_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "o2a_size", "Size of messages " + "exchanged as one-to-all operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_AGGREGATE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_o2a_size, NULL, + mca_common_monitoring_coll_messages_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "a2o_count", "Number of messages " + "exchanged as all-to-one operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_COUNTER, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_a2o_count, NULL, + mca_common_monitoring_coll_messages_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "a2o_size", "Size of messages " + "exchanged as all-to-one operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_AGGREGATE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_a2o_size, NULL, + mca_common_monitoring_coll_messages_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "a2a_count", "Number of messages " + "exchanged as all-to-all operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_COUNTER, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_a2a_count, NULL, + mca_common_monitoring_coll_messages_notify, NULL); + + (void)mca_base_pvar_register("ompi", "coll", "monitoring", "a2a_size", "Size of messages " + "exchanged as all-to-all operations in a communicator.", + OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_AGGREGATE, + MCA_MONITORING_VAR_TYPE, NULL, MPI_T_BIND_MPI_COMM, + MCA_BASE_PVAR_FLAG_READONLY, + mca_common_monitoring_coll_get_a2a_size, NULL, + mca_common_monitoring_coll_messages_notify, NULL); +} + +/** + * This PML monitors only the processes in the MPI_COMM_WORLD. As OMPI is now lazily + * adding peers on the first call to add_procs we need to check how many processes + * are in the MPI_COMM_WORLD to create the storage with the right size. + */ +int mca_common_monitoring_add_procs(struct ompi_proc_t **procs, + size_t nprocs) +{ + opal_process_name_t tmp, wp_name; + size_t i; + int peer_rank; + uint64_t key; + if( 0 > rank_world ) + rank_world = ompi_comm_rank((ompi_communicator_t*)&ompi_mpi_comm_world); + if( !nprocs_world ) + nprocs_world = ompi_comm_size((ompi_communicator_t*)&ompi_mpi_comm_world); + + if( NULL == pml_data ) { + int array_size = (10 + max_size_histogram) * nprocs_world; + pml_data = (size_t*)calloc(array_size, sizeof(size_t)); + pml_count = pml_data + nprocs_world; + filtered_pml_data = pml_count + nprocs_world; + filtered_pml_count = filtered_pml_data + nprocs_world; + osc_data_s = filtered_pml_count + nprocs_world; + osc_count_s = osc_data_s + nprocs_world; + osc_data_r = osc_count_s + nprocs_world; + osc_count_r = osc_data_r + nprocs_world; + coll_data = osc_count_r + nprocs_world; + coll_count = coll_data + nprocs_world; + + size_histogram = coll_count + nprocs_world; + } + + /* For all procs in the same MPI_COMM_WORLD we need to add them to the hash table */ + for( i = 0; i < nprocs; i++ ) { + + /* Extract the peer procname from the procs array */ + if( ompi_proc_is_sentinel(procs[i]) ) { + tmp = ompi_proc_sentinel_to_name((uintptr_t)procs[i]); + } else { + tmp = procs[i]->super.proc_name; + } + if( tmp.jobid != ompi_proc_local_proc->super.proc_name.jobid ) + continue; + + /* each process will only be added once, so there is no way it already exists in the hash */ + for( peer_rank = 0; peer_rank < nprocs_world; peer_rank++ ) { + wp_name = ompi_group_get_proc_name(((ompi_communicator_t*)&ompi_mpi_comm_world)->c_remote_group, peer_rank); + if( 0 != opal_compare_proc( tmp, wp_name ) ) + continue; + + key = *((uint64_t*)&tmp); + /* save the rank of the process in MPI_COMM_WORLD in the hash using the proc_name as the key */ + if( OPAL_SUCCESS != opal_hash_table_set_value_uint64(common_monitoring_translation_ht, + key, (void*)(uintptr_t)peer_rank) ) { + return OMPI_ERR_OUT_OF_RESOURCE; /* failed to allocate memory or growing the hash table */ + } + break; + } + } + return OMPI_SUCCESS; +} + +static void mca_common_monitoring_reset( void ) +{ + int array_size = (10 + max_size_histogram) * nprocs_world; + memset(pml_data, 0, array_size * sizeof(size_t)); + mca_common_monitoring_coll_reset(); +} + +void mca_common_monitoring_record_pml(int world_rank, size_t data_size, int tag) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ + + /* Keep tracks of the data_size distribution */ + if( 0 == data_size ) { + opal_atomic_add_size_t(&size_histogram[world_rank * max_size_histogram], 1); + } else { + int log2_size = log10(data_size)/log10_2; + if(log2_size > max_size_histogram - 2) /* Avoid out-of-bound write */ + log2_size = max_size_histogram - 2; + opal_atomic_add_size_t(&size_histogram[world_rank * max_size_histogram + log2_size + 1], 1); + } + + /* distinguishses positive and negative tags if requested */ + if( (tag < 0) && (mca_common_monitoring_filter()) ) { + opal_atomic_add_size_t(&filtered_pml_data[world_rank], data_size); + opal_atomic_add_size_t(&filtered_pml_count[world_rank], 1); + } else { /* if filtered monitoring is not activated data is aggregated indifferently */ + opal_atomic_add_size_t(&pml_data[world_rank], data_size); + opal_atomic_add_size_t(&pml_count[world_rank], 1); + } +} + +static int mca_common_monitoring_get_pml_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int i, comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_count) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = pml_count[i]; + } + + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_pml_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + int i; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_data) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = pml_data[i]; + } + + return OMPI_SUCCESS; +} + +void mca_common_monitoring_record_osc(int world_rank, size_t data_size, + enum mca_monitoring_osc_direction dir) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ + + if( SEND == dir ) { + opal_atomic_add_size_t(&osc_data_s[world_rank], data_size); + opal_atomic_add_size_t(&osc_count_s[world_rank], 1); + } else { + opal_atomic_add_size_t(&osc_data_r[world_rank], data_size); + opal_atomic_add_size_t(&osc_count_r[world_rank], 1); + } +} + +static int mca_common_monitoring_get_osc_sent_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int i, comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_count) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = osc_count_s[i]; + } + + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_osc_sent_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + int i; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_data) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = osc_data_s[i]; + } + + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_osc_recv_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int i, comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_count) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = osc_count_r[i]; + } + + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_osc_recv_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + int i; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_data) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = osc_data_r[i]; + } + + return OMPI_SUCCESS; +} + +void mca_common_monitoring_record_coll(int world_rank, size_t data_size) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ + + opal_atomic_add_size_t(&coll_data[world_rank], data_size); + opal_atomic_add_size_t(&coll_count[world_rank], 1); +} + +static int mca_common_monitoring_get_coll_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int i, comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_count) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = coll_count[i]; + } + + return OMPI_SUCCESS; +} + +static int mca_common_monitoring_get_coll_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + int comm_size = ompi_comm_size (comm); + size_t *values = (size_t*) value; + int i; + + if(comm != &ompi_mpi_comm_world.comm || NULL == pml_data) + return OMPI_ERROR; + + for (i = 0 ; i < comm_size ; ++i) { + values[i] = coll_data[i]; + } + + return OMPI_SUCCESS; +} + +static void mca_common_monitoring_output( FILE *pf, int my_rank, int nbprocs ) +{ + /* Dump outgoing messages */ + fprintf(pf, "# POINT TO POINT\n"); + for (int i = 0 ; i < nbprocs ; i++) { + if(pml_count[i] > 0) { + fprintf(pf, "E\t%" PRId32 "\t%" PRId32 "\t%zu bytes\t%zu msgs sent\t", + my_rank, i, pml_data[i], pml_count[i]); + for(int j = 0 ; j < max_size_histogram ; ++j) + fprintf(pf, "%zu%s", size_histogram[i * max_size_histogram + j], + j < max_size_histogram - 1 ? "," : "\n"); + } + } + + /* Dump outgoing synchronization/collective messages */ + if( mca_common_monitoring_filter() ) { + for (int i = 0 ; i < nbprocs ; i++) { + if(filtered_pml_count[i] > 0) { + fprintf(pf, "I\t%" PRId32 "\t%" PRId32 "\t%zu bytes\t%zu msgs sent%s", + my_rank, i, filtered_pml_data[i], filtered_pml_count[i], + 0 == pml_count[i] ? "\t" : "\n"); + /* + * In the case there was no external messages + * exchanged between the two processes, the histogram + * has not yet been dumpped. Then we need to add it at + * the end of the internal category. + */ + if(0 == pml_count[i]) { + for(int j = 0 ; j < max_size_histogram ; ++j) + fprintf(pf, "%zu%s", size_histogram[i * max_size_histogram + j], + j < max_size_histogram - 1 ? "," : "\n"); + } + } + } + } + + /* Dump incoming messages */ + fprintf(pf, "# OSC\n"); + for (int i = 0 ; i < nbprocs ; i++) { + if(osc_count_s[i] > 0) { + fprintf(pf, "S\t%" PRId32 "\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n", + my_rank, i, osc_data_s[i], osc_count_s[i]); + } + if(osc_count_r[i] > 0) { + fprintf(pf, "R\t%" PRId32 "\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n", + my_rank, i, osc_data_r[i], osc_count_r[i]); + } + } + + /* Dump collectives */ + fprintf(pf, "# COLLECTIVES\n"); + for (int i = 0 ; i < nbprocs ; i++) { + if(coll_count[i] > 0) { + fprintf(pf, "C\t%" PRId32 "\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n", + my_rank, i, coll_data[i], coll_count[i]); + } + } + mca_common_monitoring_coll_flush_all(pf); +} + +/* + * Flushes the monitoring into filename + * Useful for phases (see example in test/monitoring) + */ +static int mca_common_monitoring_flush(int fd, char* filename) +{ + /* If we are not drived by MPIT then dump the monitoring information */ + if( 0 == mca_common_monitoring_current_state || 0 == fd ) /* if disabled do nothing */ + return OMPI_SUCCESS; + + if( 1 == fd ) { + OPAL_MONITORING_PRINT_INFO("Proc %" PRId32 " flushing monitoring to stdout", rank_world); + mca_common_monitoring_output( stdout, rank_world, nprocs_world ); + } else if( 2 == fd ) { + OPAL_MONITORING_PRINT_INFO("Proc %" PRId32 " flushing monitoring to stderr", rank_world); + mca_common_monitoring_output( stderr, rank_world, nprocs_world ); + } else { + FILE *pf = NULL; + char* tmpfn = NULL; + + if( NULL == filename ) { /* No filename */ + OPAL_MONITORING_PRINT_ERR("Error while flushing: no filename provided"); + return OMPI_ERROR; + } else { + asprintf(&tmpfn, "%s.%" PRId32 ".prof", filename, rank_world); + pf = fopen(tmpfn, "w"); + free(tmpfn); + } + + if(NULL == pf) { /* Error during open */ + OPAL_MONITORING_PRINT_ERR("Error while flushing to: %s.%" PRId32 ".prof", + filename, rank_world); + return OMPI_ERROR; + } + + OPAL_MONITORING_PRINT_INFO("Proc %d flushing monitoring to: %s.%" PRId32 ".prof", + rank_world, filename, rank_world); + + mca_common_monitoring_output( pf, rank_world, nprocs_world ); + + fclose(pf); + } + /* Reset to 0 all monitored data */ + mca_common_monitoring_reset(); + return OMPI_SUCCESS; +} diff --git a/ompi/mca/common/monitoring/common_monitoring.h b/ompi/mca/common/monitoring/common_monitoring.h new file mode 100644 index 00000000000..6cde893cf13 --- /dev/null +++ b/ompi/mca/common/monitoring/common_monitoring.h @@ -0,0 +1,120 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_COMMON_MONITORING_H +#define MCA_COMMON_MONITORING_H + +BEGIN_C_DECLS + +#include +#include +#include +#include +#include +#include + +#define MCA_MONITORING_MAKE_VERSION \ + MCA_BASE_MAKE_VERSION(component, OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, OMPI_RELEASE_VERSION) + +#define OPAL_MONITORING_VERBOSE(x, ...) \ + OPAL_OUTPUT_VERBOSE((x, mca_common_monitoring_output_stream_id, __VA_ARGS__)) + +/* When built in debug mode, always display error messages */ +#if OPAL_ENABLE_DEBUG +#define OPAL_MONITORING_PRINT_ERR(...) \ + OPAL_MONITORING_VERBOSE(0, __VA_ARGS__) +#else /* if( ! OPAL_ENABLE_DEBUG ) */ +#define OPAL_MONITORING_PRINT_ERR(...) \ + OPAL_MONITORING_VERBOSE(1, __VA_ARGS__) +#endif /* OPAL_ENABLE_DEBUG */ + +#define OPAL_MONITORING_PRINT_WARN(...) \ + OPAL_MONITORING_VERBOSE(5, __VA_ARGS__) + +#define OPAL_MONITORING_PRINT_INFO(...) \ + OPAL_MONITORING_VERBOSE(10, __VA_ARGS__) + +extern int mca_common_monitoring_output_stream_id; +extern int mca_common_monitoring_enabled; +extern int mca_common_monitoring_current_state; +extern opal_hash_table_t *common_monitoring_translation_ht; + +OMPI_DECLSPEC void mca_common_monitoring_register(void*pml_monitoring_component); +OMPI_DECLSPEC void mca_common_monitoring_init( void ); +OMPI_DECLSPEC void mca_common_monitoring_finalize( void ); +OMPI_DECLSPEC int mca_common_monitoring_add_procs(struct ompi_proc_t **procs, size_t nprocs); + +/* Records PML communication */ +OMPI_DECLSPEC void mca_common_monitoring_record_pml(int world_rank, size_t data_size, int tag); + +/* SEND corresponds to data emitted from the current proc to the given + * one. RECV represents data emitted from the given proc to the + * current one. + */ +enum mca_monitoring_osc_direction { SEND, RECV }; + +/* Records OSC communications. */ +OMPI_DECLSPEC void mca_common_monitoring_record_osc(int world_rank, size_t data_size, + enum mca_monitoring_osc_direction dir); + +/* Records COLL communications. */ +OMPI_DECLSPEC void mca_common_monitoring_record_coll(int world_rank, size_t data_size); + +/* Translate the rank from the given communicator of a process to its rank in MPI_COMM_RANK. */ +static inline int mca_common_monitoring_get_world_rank(int dst, struct ompi_communicator_t*comm, + int*world_rank) +{ + opal_process_name_t tmp; + + /* find the processor of the destination */ + ompi_proc_t *proc = ompi_group_get_proc_ptr(comm->c_remote_group, dst, true); + if( ompi_proc_is_sentinel(proc) ) { + tmp = ompi_proc_sentinel_to_name((uintptr_t)proc); + } else { + tmp = proc->super.proc_name; + } + + /* find its name*/ + uint64_t rank, key = *((uint64_t*)&tmp); + /** + * If this fails the destination is not part of my MPI_COM_WORLD + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank + */ + int ret = opal_hash_table_get_value_uint64(common_monitoring_translation_ht, + key, (void *)&rank); + + /* Use intermediate variable to avoid overwriting while looking up in the hashtbale. */ + if( ret == OPAL_SUCCESS ) *world_rank = (int)rank; + return ret; +} + +/* Return the current status of the monitoring system 0 if off or the + * seperation between internal tags and external tags is disabled. Any + * other positive value if the segregation between point-to-point and + * collective is enabled. + */ +static inline int mca_common_monitoring_filter( void ) +{ + return 1 < mca_common_monitoring_current_state; +} + +/* Collective operation monitoring */ +struct mca_monitoring_coll_data_t; +typedef struct mca_monitoring_coll_data_t mca_monitoring_coll_data_t; +OMPI_DECLSPEC OBJ_CLASS_DECLARATION(mca_monitoring_coll_data_t); + +OMPI_DECLSPEC mca_monitoring_coll_data_t*mca_common_monitoring_coll_new(ompi_communicator_t*comm); +OMPI_DECLSPEC void mca_common_monitoring_coll_release(mca_monitoring_coll_data_t*data); +OMPI_DECLSPEC void mca_common_monitoring_coll_o2a(size_t size, mca_monitoring_coll_data_t*data); +OMPI_DECLSPEC void mca_common_monitoring_coll_a2o(size_t size, mca_monitoring_coll_data_t*data); +OMPI_DECLSPEC void mca_common_monitoring_coll_a2a(size_t size, mca_monitoring_coll_data_t*data); + +END_C_DECLS + +#endif /* MCA_COMMON_MONITORING_H */ diff --git a/ompi/mca/common/monitoring/common_monitoring_coll.c b/ompi/mca/common/monitoring/common_monitoring_coll.c new file mode 100644 index 00000000000..f16eac09f75 --- /dev/null +++ b/ompi/mca/common/monitoring/common_monitoring_coll.c @@ -0,0 +1,380 @@ +/* + * Copyright (c) 2013-2016 The University of Tennessee and The University + * of Tennessee Research Foundation. All rights + * reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. + * Copyright (c) 2015 Bull SAS. All rights reserved. + * Copyright (c) 2016-2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include "common_monitoring.h" +#include "common_monitoring_coll.h" +#include +#include +#include +#include +#include + +/*** Monitoring specific variables ***/ +struct mca_monitoring_coll_data_t { + opal_object_t super; + char*procs; + char*comm_name; + int world_rank; + int is_released; + ompi_communicator_t*p_comm; + size_t o2a_count; + size_t o2a_size; + size_t a2o_count; + size_t a2o_size; + size_t a2a_count; + size_t a2a_size; +}; + +/* Collectives operation monitoring */ +static opal_hash_table_t *comm_data = NULL; + +/* Check whether the communicator's name have been changed. Update the + * data->comm_name field if so. + */ +static inline void mca_common_monitoring_coll_check_name(mca_monitoring_coll_data_t*data) +{ + if( data->comm_name && data->p_comm && (data->p_comm->c_flags & OMPI_COMM_NAMEISSET) + && data->p_comm->c_name && 0 < strlen(data->p_comm->c_name) + && 0 != strncmp(data->p_comm->c_name, data->comm_name, OPAL_MAX_OBJECT_NAME - 1) ) + { + free(data->comm_name); + data->comm_name = strdup(data->p_comm->c_name); + } +} + +static inline void mca_common_monitoring_coll_cache(mca_monitoring_coll_data_t*data) +{ + int world_rank; + if( NULL == data->comm_name && 0 < strlen(data->p_comm->c_name) ) { + data->comm_name = strdup(data->p_comm->c_name); + } else { + mca_common_monitoring_coll_check_name(data); + } + if( -1 == data->world_rank ) { + /* Get current process world_rank */ + mca_common_monitoring_get_world_rank(ompi_comm_rank(data->p_comm), data->p_comm, + &data->world_rank); + } + /* Only list procs if the hashtable is already initialized, ie if the previous call worked */ + if( (-1 != data->world_rank) && (NULL == data->procs || 0 == strlen(data->procs)) ) { + int i, pos = 0, size, world_size = -1, max_length; + char*tmp_procs; + size = ompi_comm_size(data->p_comm); + world_size = ompi_comm_size((ompi_communicator_t*)&ompi_mpi_comm_world) - 1; + assert( 0 < size ); + /* Allocate enough space for list (add 1 to keep the final '\0' if already exact size) */ + max_length = snprintf(NULL, 0, "%d,", world_size - 1) + 1; + tmp_procs = malloc((1 + max_length * size) * sizeof(char)); + if( NULL == tmp_procs ) { + OPAL_MONITORING_PRINT_ERR("Cannot allocate memory for caching proc list."); + } else { + tmp_procs[0] = '\0'; + /* Build procs list */ + for(i = 0; i < size; ++i) { + mca_common_monitoring_get_world_rank(i, data->p_comm, &world_rank); + pos += sprintf(&tmp_procs[pos], "%d,", world_rank); + } + tmp_procs[pos - 1] = '\0'; /* Remove final coma */ + data->procs = realloc(tmp_procs, pos * sizeof(char)); /* Adjust to size required */ + } + } +} + +mca_monitoring_coll_data_t*mca_common_monitoring_coll_new( ompi_communicator_t*comm ) +{ + mca_monitoring_coll_data_t*data = OBJ_NEW(mca_monitoring_coll_data_t); + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: new: data structure cannot be allocated"); + return NULL; + } + + data->p_comm = comm; + + /* Allocate hashtable */ + if( NULL == comm_data ) { + comm_data = OBJ_NEW(opal_hash_table_t); + if( NULL == comm_data ) { + OPAL_MONITORING_PRINT_ERR("coll: new: failed to allocate hashtable"); + return data; + } + opal_hash_table_init(comm_data, 2048); + } + + /* Insert in hashtable */ + uint64_t key = *((uint64_t*)&comm); + if( OPAL_SUCCESS != opal_hash_table_set_value_uint64(comm_data, key, (void*)data) ) { + OPAL_MONITORING_PRINT_ERR("coll: new: failed to allocate memory or " + "growing the hash table"); + } + + /* Cache data so the procs can be released without affecting the output */ + mca_common_monitoring_coll_cache(data); + + return data; +} + +void mca_common_monitoring_coll_release(mca_monitoring_coll_data_t*data) +{ +#if OPAL_ENABLE_DEBUG + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: release: data structure empty or already desallocated"); + return; + } +#endif /* OPAL_ENABLE_DEBUG */ + + /* not flushed yet */ + mca_common_monitoring_coll_cache(data); + data->is_released = 1; +} + +static void mca_common_monitoring_coll_cond_release(mca_monitoring_coll_data_t*data) +{ +#if OPAL_ENABLE_DEBUG + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: release: data structure empty or already desallocated"); + return; + } +#endif /* OPAL_ENABLE_DEBUG */ + + if( data->is_released ) { /* if the communicator is already released */ + opal_hash_table_remove_value_uint64(comm_data, *((uint64_t*)&data->p_comm)); + data->p_comm = NULL; + free(data->comm_name); + free(data->procs); + OBJ_RELEASE(data); + } +} + +void mca_common_monitoring_coll_finalize( void ) +{ + if( NULL != comm_data ) { + opal_hash_table_remove_all( comm_data ); + OBJ_RELEASE(comm_data); + } +} + +void mca_common_monitoring_coll_flush(FILE *pf, mca_monitoring_coll_data_t*data) +{ + /* Check for any change in the communicator's name */ + mca_common_monitoring_coll_check_name(data); + + /* Flush data */ + fprintf(pf, + "D\t%s\tprocs: %s\n" + "O2A\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n" + "A2O\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n" + "A2A\t%" PRId32 "\t%zu bytes\t%zu msgs sent\n", + data->comm_name ? data->comm_name : "(no-name)", data->procs, + data->world_rank, data->o2a_size, data->o2a_count, + data->world_rank, data->a2o_size, data->a2o_count, + data->world_rank, data->a2a_size, data->a2a_count); +} + +void mca_common_monitoring_coll_flush_all(FILE *pf) +{ + if( NULL == comm_data ) return; /* No hashtable */ + + uint64_t key; + mca_monitoring_coll_data_t*previous = NULL, *data; + + OPAL_HASH_TABLE_FOREACH(key, uint64, data, comm_data) { + if( NULL != previous && NULL == previous->p_comm ) { + /* Phase flushed -> free already released once coll_data_t */ + mca_common_monitoring_coll_cond_release(previous); + } + mca_common_monitoring_coll_flush(pf, data); + previous = data; + } + mca_common_monitoring_coll_cond_release(previous); +} + + +void mca_common_monitoring_coll_reset(void) +{ + if( NULL == comm_data ) return; /* No hashtable */ + + uint64_t key; + mca_monitoring_coll_data_t*data; + + OPAL_HASH_TABLE_FOREACH(key, uint64, data, comm_data) { + data->o2a_count = 0; data->o2a_size = 0; + data->a2o_count = 0; data->a2o_size = 0; + data->a2a_count = 0; data->a2a_size = 0; + } +} + +int mca_common_monitoring_coll_messages_notify(mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj_handle, + int *count) +{ + switch (event) { + case MCA_BASE_PVAR_HANDLE_BIND: + *count = 1; + case MCA_BASE_PVAR_HANDLE_UNBIND: + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_START: + mca_common_monitoring_current_state = mca_common_monitoring_enabled; + return OMPI_SUCCESS; + case MCA_BASE_PVAR_HANDLE_STOP: + mca_common_monitoring_current_state = 0; + return OMPI_SUCCESS; + } + + return OMPI_ERROR; +} + +void mca_common_monitoring_coll_o2a(size_t size, mca_monitoring_coll_data_t*data) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ +#if OPAL_ENABLE_DEBUG + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: o2a: data structure empty"); + return; + } +#endif /* OPAL_ENABLE_DEBUG */ + opal_atomic_add_size_t(&data->o2a_size, size); + opal_atomic_add_size_t(&data->o2a_count, 1); +} + +int mca_common_monitoring_coll_get_o2a_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->o2a_count; + } + return ret; +} + +int mca_common_monitoring_coll_get_o2a_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->o2a_size; + } + return ret; +} + +void mca_common_monitoring_coll_a2o(size_t size, mca_monitoring_coll_data_t*data) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ +#if OPAL_ENABLE_DEBUG + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: a2o: data structure empty"); + return; + } +#endif /* OPAL_ENABLE_DEBUG */ + opal_atomic_add_size_t(&data->a2o_size, size); + opal_atomic_add_size_t(&data->a2o_count, 1); +} + +int mca_common_monitoring_coll_get_a2o_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->a2o_count; + } + return ret; +} + +int mca_common_monitoring_coll_get_a2o_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->a2o_size; + } + return ret; +} + +void mca_common_monitoring_coll_a2a(size_t size, mca_monitoring_coll_data_t*data) +{ + if( 0 == mca_common_monitoring_current_state ) return; /* right now the monitoring is not started */ +#if OPAL_ENABLE_DEBUG + if( NULL == data ) { + OPAL_MONITORING_PRINT_ERR("coll: a2a: data structure empty"); + return; + } +#endif /* OPAL_ENABLE_DEBUG */ + opal_atomic_add_size_t(&data->a2a_size, size); + opal_atomic_add_size_t(&data->a2a_count, 1); +} + +int mca_common_monitoring_coll_get_a2a_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->a2a_count; + } + return ret; +} + +int mca_common_monitoring_coll_get_a2a_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle) +{ + ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; + size_t *value_size = (size_t*) value; + mca_monitoring_coll_data_t*data; + int ret = opal_hash_table_get_value_uint64(comm_data, *((uint64_t*)&comm), (void*)&data); + if( OPAL_SUCCESS == ret ) { + *value_size = data->a2a_size; + } + return ret; +} + +static void mca_monitoring_coll_construct (mca_monitoring_coll_data_t*coll_data) +{ + coll_data->procs = NULL; + coll_data->comm_name = NULL; + coll_data->world_rank = -1; + coll_data->p_comm = NULL; + coll_data->is_released = 0; + coll_data->o2a_count = 0; + coll_data->o2a_size = 0; + coll_data->a2o_count = 0; + coll_data->a2o_size = 0; + coll_data->a2a_count = 0; + coll_data->a2a_size = 0; +} + +static void mca_monitoring_coll_destruct (mca_monitoring_coll_data_t*coll_data){} + +OBJ_CLASS_INSTANCE(mca_monitoring_coll_data_t, opal_object_t, mca_monitoring_coll_construct, mca_monitoring_coll_destruct); diff --git a/ompi/mca/common/monitoring/common_monitoring_coll.h b/ompi/mca/common/monitoring/common_monitoring_coll.h new file mode 100644 index 00000000000..3deb4d0ad4f --- /dev/null +++ b/ompi/mca/common/monitoring/common_monitoring_coll.h @@ -0,0 +1,59 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 The University of Tennessee and The University + * of Tennessee Research Foundation. All rights + * reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_COMMON_MONITORING_COLL_H +#define MCA_COMMON_MONITORING_COLL_H + +BEGIN_C_DECLS + +#include +#include + +OMPI_DECLSPEC void mca_common_monitoring_coll_flush(FILE *pf, mca_monitoring_coll_data_t*data); + +OMPI_DECLSPEC void mca_common_monitoring_coll_flush_all(FILE *pf); + +OMPI_DECLSPEC void mca_common_monitoring_coll_reset( void ); + +OMPI_DECLSPEC int mca_common_monitoring_coll_messages_notify(mca_base_pvar_t *pvar, + mca_base_pvar_event_t event, + void *obj_handle, + int *count); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_o2a_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_o2a_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_a2o_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_a2o_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_a2a_count(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC int mca_common_monitoring_coll_get_a2a_size(const struct mca_base_pvar_t *pvar, + void *value, + void *obj_handle); + +OMPI_DECLSPEC void mca_common_monitoring_coll_finalize( void ); +END_C_DECLS + +#endif /* MCA_COMMON_MONITORING_COLL_H */ diff --git a/ompi/mca/osc/monitoring/Makefile.am b/ompi/mca/osc/monitoring/Makefile.am new file mode 100644 index 00000000000..7288793990e --- /dev/null +++ b/ompi/mca/osc/monitoring/Makefile.am @@ -0,0 +1,38 @@ +# +# Copyright (c) 2016 Inria. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +monitoring_sources = \ + osc_monitoring.h \ + osc_monitoring_comm.h \ + osc_monitoring_component.c \ + osc_monitoring_accumulate.h \ + osc_monitoring_passive_target.h \ + osc_monitoring_active_target.h \ + osc_monitoring_dynamic.h \ + osc_monitoring_module.h \ + osc_monitoring_template.h + +if MCA_BUILD_ompi_osc_monitoring_DSO +component_noinst = +component_install = mca_osc_monitoring.la +else +component_noinst = libmca_osc_monitoring.la +component_install = +endif + +mcacomponentdir = $(ompilibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_osc_monitoring_la_SOURCES = $(monitoring_sources) +mca_osc_monitoring_la_LDFLAGS = -module -avoid-version +mca_osc_monitoring_la_LIBADD = \ + $(OMPI_TOP_BUILDDIR)/ompi/mca/common/monitoring/libmca_common_monitoring.la + +noinst_LTLIBRARIES = $(component_noinst) +libmca_osc_monitoring_la_SOURCES = $(monitoring_sources) +libmca_osc_monitoring_la_LDFLAGS = -module -avoid-version diff --git a/ompi/mca/osc/monitoring/configure.m4 b/ompi/mca/osc/monitoring/configure.m4 new file mode 100644 index 00000000000..24b8bfbd87e --- /dev/null +++ b/ompi/mca/osc/monitoring/configure.m4 @@ -0,0 +1,19 @@ +# -*- shell-script -*- +# +# Copyright (c) 2016 Inria. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_ompi_osc_monitoring_CONFIG() +# ------------------------------------------------ +AC_DEFUN([MCA_ompi_osc_monitoring_CONFIG],[ + AC_CONFIG_FILES([ompi/mca/osc/monitoring/Makefile]) + + OPAL_CHECK_PORTALS4([osc_monitoring], + [AC_DEFINE([OMPI_WITH_OSC_PORTALS4], [1], [Whether or not to generate template for osc_portals4])], + []) + ])dnl diff --git a/ompi/mca/osc/monitoring/osc_monitoring.h b/ompi/mca/osc/monitoring/osc_monitoring.h new file mode 100644 index 00000000000..8a223e459e4 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring.h @@ -0,0 +1,29 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_H +#define MCA_OSC_MONITORING_H + +BEGIN_C_DECLS + +#include +#include +#include + +struct ompi_osc_monitoring_component_t { + ompi_osc_base_component_t super; + int priority; +}; +typedef struct ompi_osc_monitoring_component_t ompi_osc_monitoring_component_t; + +OMPI_DECLSPEC extern ompi_osc_monitoring_component_t mca_osc_monitoring_component; + +END_C_DECLS + +#endif /* MCA_OSC_MONITORING_H */ diff --git a/ompi/mca/osc/monitoring/osc_monitoring_accumulate.h b/ompi/mca/osc/monitoring/osc_monitoring_accumulate.h new file mode 100644 index 00000000000..543740146c7 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_accumulate.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_ACCUMULATE_H +#define MCA_OSC_MONITORING_ACCUMULATE_H + +#include +#include +#include + +#define OSC_MONITORING_GENERATE_TEMPLATE_ACCUMULATE(template) \ + \ + static int ompi_osc_monitoring_## template ##_compare_and_swap (const void *origin_addr, \ + const void *compare_addr, \ + void *result_addr, \ + ompi_datatype_t *dt, \ + int target_rank, \ + ptrdiff_t target_disp, \ + ompi_win_t *win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size; \ + ompi_datatype_type_size(dt, &type_size); \ + mca_common_monitoring_record_osc(world_rank, type_size, SEND); \ + mca_common_monitoring_record_osc(world_rank, type_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Compare_and_swap to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_compare_and_swap(origin_addr, compare_addr, result_addr, dt, target_rank, target_disp, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_get_accumulate (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t*origin_datatype, \ + void *result_addr, \ + int result_count, \ + ompi_datatype_t*result_datatype, \ + int target_rank, \ + MPI_Aint target_disp, \ + int target_count, \ + ompi_datatype_t*target_datatype, \ + ompi_op_t *op, ompi_win_t*win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + ompi_datatype_type_size(result_datatype, &type_size); \ + data_size = result_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Get_accumulate to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_get_accumulate(origin_addr, origin_count, origin_datatype, result_addr, result_count, result_datatype, target_rank, target_disp, target_count, target_datatype, op, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_rget_accumulate (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t *origin_datatype, \ + void *result_addr, \ + int result_count, \ + ompi_datatype_t *result_datatype, \ + int target_rank, \ + MPI_Aint target_disp, \ + int target_count, \ + ompi_datatype_t*target_datatype, \ + ompi_op_t *op, \ + ompi_win_t *win, \ + ompi_request_t **request) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + ompi_datatype_type_size(result_datatype, &type_size); \ + data_size = result_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Rget_accumulate to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_rget_accumulate(origin_addr, origin_count, origin_datatype, result_addr, result_count, result_datatype, target_rank, target_disp, target_count, target_datatype, op, win, request); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_raccumulate (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int target_rank, \ + ptrdiff_t target_disp, \ + int target_count, \ + ompi_datatype_t *target_datatype, \ + ompi_op_t *op, ompi_win_t *win, \ + ompi_request_t **request) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + OPAL_MONITORING_PRINT_INFO("MPI_Raccumulate to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_raccumulate(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, op, win, request); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_accumulate (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int target_rank, \ + ptrdiff_t target_disp, \ + int target_count, \ + ompi_datatype_t *target_datatype, \ + ompi_op_t *op, ompi_win_t *win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + OPAL_MONITORING_PRINT_INFO("MPI_Accumulate to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_accumulate(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, op, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_fetch_and_op (const void *origin_addr, \ + void *result_addr, \ + ompi_datatype_t *dt, \ + int target_rank, \ + ptrdiff_t target_disp, \ + ompi_op_t *op, ompi_win_t *win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size; \ + ompi_datatype_type_size(dt, &type_size); \ + mca_common_monitoring_record_osc(world_rank, type_size, SEND); \ + mca_common_monitoring_record_osc(world_rank, type_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Fetch_and_op to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_fetch_and_op(origin_addr, result_addr, dt, target_rank, target_disp, op, win); \ + } + +#endif /* MCA_OSC_MONITORING_ACCUMULATE_H */ diff --git a/ompi/mca/osc/monitoring/osc_monitoring_active_target.h b/ompi/mca/osc/monitoring/osc_monitoring_active_target.h new file mode 100644 index 00000000000..3420bf60dc6 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_active_target.h @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2016 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_ACTIVE_TARGET_H +#define MCA_OSC_MONITORING_ACTIVE_TARGET_H + +#include +#include + +#define OSC_MONITORING_GENERATE_TEMPLATE_ACTIVE_TARGET(template) \ + \ + static int ompi_osc_monitoring_## template ##_post (ompi_group_t *group, int assert, ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_post(group, assert, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_start (ompi_group_t *group, int assert, ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_start(group, assert, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_complete (ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_complete(win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_wait (ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_wait(win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_test (ompi_win_t *win, int *flag) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_test(win, flag); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_fence (int assert, ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_fence(assert, win); \ + } + +#endif /* MCA_OSC_MONITORING_ACTIVE_TARGET_H */ diff --git a/ompi/mca/osc/monitoring/osc_monitoring_comm.h b/ompi/mca/osc/monitoring/osc_monitoring_comm.h new file mode 100644 index 00000000000..173a821427f --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_comm.h @@ -0,0 +1,118 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_COMM_H +#define MCA_OSC_MONITORING_COMM_H + +#include +#include +#include + +#define OSC_MONITORING_GENERATE_TEMPLATE_COMM(template) \ + \ + static int ompi_osc_monitoring_## template ##_put (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int target_rank, \ + ptrdiff_t target_disp, \ + int target_count, \ + ompi_datatype_t *target_datatype, \ + ompi_win_t *win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + OPAL_MONITORING_PRINT_INFO("MPI_Put to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_put(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_rput (const void *origin_addr, \ + int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int target_rank, \ + ptrdiff_t target_disp, \ + int target_count, \ + ompi_datatype_t *target_datatype, \ + ompi_win_t *win, \ + ompi_request_t **request) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(target_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, data_size, SEND); \ + OPAL_MONITORING_PRINT_INFO("MPI_Rput to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_rput(origin_addr, origin_count, origin_datatype, target_rank, target_disp, target_count, target_datatype, win, request); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_get (void *origin_addr, int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int source_rank, \ + ptrdiff_t source_disp, \ + int source_count, \ + ompi_datatype_t *source_datatype, \ + ompi_win_t *win) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(source_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, 0, SEND); \ + mca_common_monitoring_record_osc(world_rank, data_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Get to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_get(origin_addr, origin_count, origin_datatype, source_rank, source_disp, source_count, source_datatype, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_rget (void *origin_addr, int origin_count, \ + ompi_datatype_t *origin_datatype, \ + int source_rank, \ + ptrdiff_t source_disp, \ + int source_count, \ + ompi_datatype_t *source_datatype, \ + ompi_win_t *win, \ + ompi_request_t **request) \ + { \ + int world_rank; \ + /** \ + * If this fails the destination is not part of my MPI_COM_WORLD \ + * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank \ + */ \ + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(source_rank, ompi_osc_monitoring_## template ##_get_comm(win), &world_rank)) { \ + size_t type_size, data_size; \ + ompi_datatype_type_size(origin_datatype, &type_size); \ + data_size = origin_count*type_size; \ + mca_common_monitoring_record_osc(world_rank, 0, SEND); \ + mca_common_monitoring_record_osc(world_rank, data_size, RECV); \ + OPAL_MONITORING_PRINT_INFO("MPI_Rget to %d intercepted", world_rank); \ + } \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_rget(origin_addr, origin_count, origin_datatype, source_rank, source_disp, source_count, source_datatype, win, request); \ + } + +#endif /* MCA_OSC_MONITORING_COMM_H */ + diff --git a/ompi/mca/osc/monitoring/osc_monitoring_component.c b/ompi/mca/osc/monitoring/osc_monitoring_component.c new file mode 100644 index 00000000000..1641b93bb92 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_component.c @@ -0,0 +1,154 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include "osc_monitoring.h" +#include +#include +#include +#include +#include +#include +#include +#include + +/***************************************/ +/* Include template generating macros */ +#include "osc_monitoring_template.h" + +#include +OSC_MONITORING_MODULE_TEMPLATE_GENERATE(rdma, ompi_osc_rdma_module_t, comm) +#undef GET_MODULE + +#include +OSC_MONITORING_MODULE_TEMPLATE_GENERATE(sm, ompi_osc_sm_module_t, comm) +#undef GET_MODULE + +#include +OSC_MONITORING_MODULE_TEMPLATE_GENERATE(pt2pt, ompi_osc_pt2pt_module_t, comm) +#undef GET_MODULE + +#ifdef OMPI_WITH_OSC_PORTALS4 +#include +OSC_MONITORING_MODULE_TEMPLATE_GENERATE(portals4, ompi_osc_portals4_module_t, comm) +#undef GET_MODULE +#endif /* OMPI_WITH_OSC_PORTALS4 */ + +/***************************************/ + +static int mca_osc_monitoring_component_init(bool enable_progress_threads, + bool enable_mpi_threads) +{ + OPAL_MONITORING_PRINT_INFO("osc_component_init"); + mca_common_monitoring_init(); + return OMPI_SUCCESS; +} + +static int mca_osc_monitoring_component_finish(void) +{ + OPAL_MONITORING_PRINT_INFO("osc_component_finish"); + mca_common_monitoring_finalize(); + return OMPI_SUCCESS; +} + +static int mca_osc_monitoring_component_register(void) +{ + return OMPI_SUCCESS; +} + +static int mca_osc_monitoring_component_query(struct ompi_win_t *win, void **base, size_t size, int disp_unit, + struct ompi_communicator_t *comm, struct opal_info_t *info, + int flavor) +{ + OPAL_MONITORING_PRINT_INFO("osc_component_query"); + return mca_osc_monitoring_component.priority; +} + +static int mca_osc_monitoring_component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit, + struct ompi_communicator_t *comm, struct opal_info_t *info, + int flavor, int *model) +{ + OPAL_MONITORING_PRINT_INFO("osc_component_select"); + opal_list_item_t *item; + ompi_osc_base_component_t *best_component = NULL; + int best_priority = -1, priority, ret = OMPI_SUCCESS; + + /* Redo the select loop to add our layer in the middle */ + for (item = opal_list_get_first(&ompi_osc_base_framework.framework_components) ; + item != opal_list_get_end(&ompi_osc_base_framework.framework_components) ; + item = opal_list_get_next(item)) { + ompi_osc_base_component_t *component = (ompi_osc_base_component_t*) + ((mca_base_component_list_item_t*) item)->cli_component; + + if( component == (ompi_osc_base_component_t*)(&mca_osc_monitoring_component) ) + continue; /* skip self */ + + priority = component->osc_query(win, base, size, disp_unit, comm, info, flavor); + if (priority < 0) { + if (MPI_WIN_FLAVOR_SHARED == flavor && OMPI_ERR_RMA_SHARED == priority) { + /* NTH: quick fix to return OMPI_ERR_RMA_SHARED */ + return OMPI_ERR_RMA_SHARED; + } + continue; + } + + if (priority > best_priority) { + best_component = component; + best_priority = priority; + } + } + + if (NULL == best_component) return OMPI_ERR_NOT_SUPPORTED; + OPAL_MONITORING_PRINT_INFO("osc: chosen one: %s", best_component->osc_version.mca_component_name); + ret = best_component->osc_select(win, base, size, disp_unit, comm, info, flavor, model); + if( OMPI_SUCCESS == ret ) { + /* Intercept module functions with ours, based on selected component */ + if( 0 == strcmp("rdma", best_component->osc_version.mca_component_name) ) { + OSC_MONITORING_SET_TEMPLATE(rdma, win->w_osc_module); + } else if( 0 == strcmp("sm", best_component->osc_version.mca_component_name) ) { + OSC_MONITORING_SET_TEMPLATE(sm, win->w_osc_module); + } else if( 0 == strcmp("pt2pt", best_component->osc_version.mca_component_name) ) { + OSC_MONITORING_SET_TEMPLATE(pt2pt, win->w_osc_module); +#ifdef OMPI_WITH_OSC_PORTALS4 + } else if( 0 == strcmp("portals4", best_component->osc_version.mca_component_name) ) { + OSC_MONITORING_SET_TEMPLATE(portals4, win->w_osc_module); +#endif /* OMPI_WITH_OSC_PORTALS4 */ + } else { + OPAL_MONITORING_PRINT_WARN("osc: monitoring disabled: no module for this component " + "(%s)", best_component->osc_version.mca_component_name); + return ret; + } + } + return ret; +} + +ompi_osc_monitoring_component_t mca_osc_monitoring_component = { + .super = { + /* First, the mca_base_component_t struct containing meta + information about the component itself */ + .osc_version = { + OMPI_OSC_BASE_VERSION_3_0_0, + + .mca_component_name = "monitoring", /* MCA component name */ + MCA_MONITORING_MAKE_VERSION, + .mca_register_component_params = mca_osc_monitoring_component_register + }, + .osc_data = { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + .osc_init = mca_osc_monitoring_component_init, /* component init */ + .osc_finalize = mca_osc_monitoring_component_finish, /* component finalize */ + .osc_query = mca_osc_monitoring_component_query, + .osc_select = mca_osc_monitoring_component_select + }, + .priority = INT_MAX +}; + diff --git a/ompi/mca/osc/monitoring/osc_monitoring_dynamic.h b/ompi/mca/osc/monitoring/osc_monitoring_dynamic.h new file mode 100644 index 00000000000..5a8101ea200 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_dynamic.h @@ -0,0 +1,27 @@ +/* + * Copyright (c) 2016 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_DYNAMIC_H +#define MCA_OSC_MONITORING_DYNAMIC_H + +#include + +#define OSC_MONITORING_GENERATE_TEMPLATE_DYNAMIC(template) \ + \ + static int ompi_osc_monitoring_## template ##_attach (struct ompi_win_t *win, void *base, size_t len) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_win_attach(win, base, len); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_detach (struct ompi_win_t *win, const void *base) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_win_detach(win, base); \ + } + +#endif /* MCA_OSC_MONITORING_DYNAMIC_H */ diff --git a/ompi/mca/osc/monitoring/osc_monitoring_module.h b/ompi/mca/osc/monitoring/osc_monitoring_module.h new file mode 100644 index 00000000000..88eb2248d64 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_module.h @@ -0,0 +1,89 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_MODULE_H +#define MCA_OSC_MONITORING_MODULE_H + +#include +#include +#include + +/* Define once and for all the module_template variable name */ +#define OMPI_OSC_MONITORING_MODULE_VARIABLE(template) \ + ompi_osc_monitoring_module_## template ##_template + +/* Define once and for all the module_template variable name */ +#define OMPI_OSC_MONITORING_MODULE_INIT(template) \ + ompi_osc_monitoring_module_## template ##_init_done + +/* Define once and for all the template variable name */ +#define OMPI_OSC_MONITORING_TEMPLATE_VARIABLE(template) \ + ompi_osc_monitoring_## template ##_template + +/* Define the ompi_osc_monitoring_module_## template ##_template variable */ +#define OMPI_OSC_MONITORING_MODULE_GENERATE(template) \ + static ompi_osc_base_module_t OMPI_OSC_MONITORING_MODULE_VARIABLE(template) + +/* Define the ompi_osc_monitoring_module_## template ##_init_done variable */ +#define OMPI_OSC_MONITORING_MODULE_INIT_GENERATE(template) \ + static int64_t OMPI_OSC_MONITORING_MODULE_INIT(template) + +/* Define and set the ompi_osc_monitoring_## template ##_template + * variable. The functions recorded here are linked to the original + * functions of the original {template} module that were replaced. + */ +#define MCA_OSC_MONITORING_MODULE_TEMPLATE_GENERATE(template) \ + static ompi_osc_base_module_t OMPI_OSC_MONITORING_TEMPLATE_VARIABLE(template) = { \ + .osc_win_attach = ompi_osc_monitoring_## template ##_attach, \ + .osc_win_detach = ompi_osc_monitoring_## template ##_detach, \ + .osc_free = ompi_osc_monitoring_## template ##_free, \ + \ + .osc_put = ompi_osc_monitoring_## template ##_put, \ + .osc_get = ompi_osc_monitoring_## template ##_get, \ + .osc_accumulate = ompi_osc_monitoring_## template ##_accumulate, \ + .osc_compare_and_swap = ompi_osc_monitoring_## template ##_compare_and_swap, \ + .osc_fetch_and_op = ompi_osc_monitoring_## template ##_fetch_and_op, \ + .osc_get_accumulate = ompi_osc_monitoring_## template ##_get_accumulate, \ + \ + .osc_rput = ompi_osc_monitoring_## template ##_rput, \ + .osc_rget = ompi_osc_monitoring_## template ##_rget, \ + .osc_raccumulate = ompi_osc_monitoring_## template ##_raccumulate, \ + .osc_rget_accumulate = ompi_osc_monitoring_## template ##_rget_accumulate, \ + \ + .osc_fence = ompi_osc_monitoring_## template ##_fence, \ + \ + .osc_start = ompi_osc_monitoring_## template ##_start, \ + .osc_complete = ompi_osc_monitoring_## template ##_complete, \ + .osc_post = ompi_osc_monitoring_## template ##_post, \ + .osc_wait = ompi_osc_monitoring_## template ##_wait, \ + .osc_test = ompi_osc_monitoring_## template ##_test, \ + \ + .osc_lock = ompi_osc_monitoring_## template ##_lock, \ + .osc_unlock = ompi_osc_monitoring_## template ##_unlock, \ + .osc_lock_all = ompi_osc_monitoring_## template ##_lock_all, \ + .osc_unlock_all = ompi_osc_monitoring_## template ##_unlock_all, \ + \ + .osc_sync = ompi_osc_monitoring_## template ##_sync, \ + .osc_flush = ompi_osc_monitoring_## template ##_flush, \ + .osc_flush_all = ompi_osc_monitoring_## template ##_flush_all, \ + .osc_flush_local = ompi_osc_monitoring_## template ##_flush_local, \ + .osc_flush_local_all = ompi_osc_monitoring_## template ##_flush_local_all, \ + } + +#define OSC_MONITORING_GENERATE_TEMPLATE_MODULE(template) \ + \ + static int ompi_osc_monitoring_## template ##_free(ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_free(win); \ + } + +#endif /* MCA_OSC_MONITORING_MODULE_H */ + diff --git a/ompi/mca/osc/monitoring/osc_monitoring_passive_target.h b/ompi/mca/osc/monitoring/osc_monitoring_passive_target.h new file mode 100644 index 00000000000..9e91b3f6e76 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_passive_target.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2016 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_PASSIVE_TARGET_H +#define MCA_OSC_MONITORING_PASSIVE_TARGET_H + +#include + +#define OSC_MONITORING_GENERATE_TEMPLATE_PASSIVE_TARGET(template) \ + \ + static int ompi_osc_monitoring_## template ##_sync (struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_sync(win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_flush (int target, struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_flush(target, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_flush_all (struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_flush_all(win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_flush_local (int target, struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_flush_local(target, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_flush_local_all (struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_flush_local_all(win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_lock (int lock_type, int target, int assert, ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_lock(lock_type, target, assert, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_unlock (int target, ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_unlock(target, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_lock_all (int assert, struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_lock_all(assert, win); \ + } \ + \ + static int ompi_osc_monitoring_## template ##_unlock_all (struct ompi_win_t *win) \ + { \ + return OMPI_OSC_MONITORING_MODULE_VARIABLE(template).osc_unlock_all(win); \ + } + +#endif /* MCA_OSC_MONITORING_PASSIVE_TARGET_H */ + diff --git a/ompi/mca/osc/monitoring/osc_monitoring_template.h b/ompi/mca/osc/monitoring/osc_monitoring_template.h new file mode 100644 index 00000000000..85475733f98 --- /dev/null +++ b/ompi/mca/osc/monitoring/osc_monitoring_template.h @@ -0,0 +1,79 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#ifndef MCA_OSC_MONITORING_TEMPLATE_H +#define MCA_OSC_MONITORING_TEMPLATE_H + +#include +#include +#include +#include +#include "osc_monitoring_accumulate.h" +#include "osc_monitoring_active_target.h" +#include "osc_monitoring_comm.h" +#include "osc_monitoring_dynamic.h" +#include "osc_monitoring_module.h" +#include "osc_monitoring_passive_target.h" + +/* module_type correspond to the ompi_osc_## template ##_module_t type + * comm correspond to the comm field name in ompi_osc_## template ##_module_t + * + * The magic used here is that for a given module type (given with the + * {template} parameter), we generate a set of every functions defined + * in ompi_osc_base_module_t, the ompi_osc_monitoring_module_## + * template ##_template variable recording the original set of + * functions, and the ompi_osc_monitoring_## template ##_template + * variable that record the generated set of functions. When a + * function is called from the original module, we route the call to + * our generated function that does the monitoring, and then we call + * the original function that had been saved in the + * ompi_osc_monitoring_module_## template ##_template variable. + */ +#define OSC_MONITORING_MODULE_TEMPLATE_GENERATE(template, module_type, comm) \ + /* Generate the proper symbol for the \ + ompi_osc_monitoring_module_## template ##_template variable */ \ + OMPI_OSC_MONITORING_MODULE_GENERATE(template); \ + OMPI_OSC_MONITORING_MODULE_INIT_GENERATE(template); \ + /* Generate module specific module->comm accessor */ \ + static inline struct ompi_communicator_t* \ + ompi_osc_monitoring_## template ##_get_comm(ompi_win_t*win) \ + { \ + return ((module_type*)win->w_osc_module)->comm; \ + } \ + /* Generate each module specific functions */ \ + OSC_MONITORING_GENERATE_TEMPLATE_ACCUMULATE(template) \ + OSC_MONITORING_GENERATE_TEMPLATE_ACTIVE_TARGET(template) \ + OSC_MONITORING_GENERATE_TEMPLATE_COMM(template) \ + OSC_MONITORING_GENERATE_TEMPLATE_DYNAMIC(template) \ + OSC_MONITORING_GENERATE_TEMPLATE_MODULE(template) \ + OSC_MONITORING_GENERATE_TEMPLATE_PASSIVE_TARGET(template) \ + /* Set the mca_osc_monitoring_## template ##_template variable */ \ + MCA_OSC_MONITORING_MODULE_TEMPLATE_GENERATE(template); \ + /* Generate template specific module initialization function */ \ + static inline void* \ + ompi_osc_monitoring_## template ##_set_template (ompi_osc_base_module_t*module) \ + { \ + if( 1 == opal_atomic_add_64(&(OMPI_OSC_MONITORING_MODULE_INIT(template)), 1) ) { \ + /* Saves the original module functions in \ + * ompi_osc_monitoring_module_## template ##_template \ + */ \ + memcpy(&OMPI_OSC_MONITORING_MODULE_VARIABLE(template), \ + module, sizeof(ompi_osc_base_module_t)); \ + } \ + /* Replace the original functions with our generated ones */ \ + memcpy(module, &OMPI_OSC_MONITORING_TEMPLATE_VARIABLE(template), \ + sizeof(ompi_osc_base_module_t)); \ + return module; \ + } + +#define OSC_MONITORING_SET_TEMPLATE(template, module) \ + ompi_osc_monitoring_## template ##_set_template(module) + +#endif /* MCA_OSC_MONITORING_TEMPLATE_H */ + diff --git a/ompi/mca/pml/monitoring/Makefile.am b/ompi/mca/pml/monitoring/Makefile.am index 517af90c0fd..3af691b0ee6 100644 --- a/ompi/mca/pml/monitoring/Makefile.am +++ b/ompi/mca/pml/monitoring/Makefile.am @@ -11,7 +11,6 @@ # monitoring_sources = \ - pml_monitoring.c \ pml_monitoring.h \ pml_monitoring_comm.c \ pml_monitoring_component.c \ @@ -32,6 +31,8 @@ mcacomponentdir = $(ompilibdir) mcacomponent_LTLIBRARIES = $(component_install) mca_pml_monitoring_la_SOURCES = $(monitoring_sources) mca_pml_monitoring_la_LDFLAGS = -module -avoid-version +mca_pml_monitoring_la_LIBADD = \ + $(OMPI_TOP_BUILDDIR)/ompi/mca/common/monitoring/libmca_common_monitoring.la noinst_LTLIBRARIES = $(component_noinst) libmca_pml_monitoring_la_SOURCES = $(monitoring_sources) diff --git a/ompi/mca/pml/monitoring/pml_monitoring.c b/ompi/mca/pml/monitoring/pml_monitoring.c deleted file mode 100644 index 5fc7bee32a0..00000000000 --- a/ompi/mca/pml/monitoring/pml_monitoring.c +++ /dev/null @@ -1,258 +0,0 @@ -/* - * Copyright (c) 2013-2016 The University of Tennessee and The University - * of Tennessee Research Foundation. All rights - * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. - * Copyright (c) 2015 Bull SAS. All rights reserved. - * Copyright (c) 2016 Research Organization for Information Science - * and Technology (RIST). All rights reserved. - * $COPYRIGHT$ - * - * Additional copyrights may follow - * - * $HEADER$ - */ - -#include -#include -#include "opal/class/opal_hash_table.h" - -/* array for stroring monitoring data*/ -uint64_t* sent_data = NULL; -uint64_t* messages_count = NULL; -uint64_t* filtered_sent_data = NULL; -uint64_t* filtered_messages_count = NULL; - -static int init_done = 0; -static int nbprocs = -1; -static int my_rank = -1; -opal_hash_table_t *translation_ht = NULL; - - -mca_pml_monitoring_module_t mca_pml_monitoring = { - mca_pml_monitoring_add_procs, - mca_pml_monitoring_del_procs, - mca_pml_monitoring_enable, - NULL, - mca_pml_monitoring_add_comm, - mca_pml_monitoring_del_comm, - mca_pml_monitoring_irecv_init, - mca_pml_monitoring_irecv, - mca_pml_monitoring_recv, - mca_pml_monitoring_isend_init, - mca_pml_monitoring_isend, - mca_pml_monitoring_send, - mca_pml_monitoring_iprobe, - mca_pml_monitoring_probe, - mca_pml_monitoring_start, - mca_pml_monitoring_improbe, - mca_pml_monitoring_mprobe, - mca_pml_monitoring_imrecv, - mca_pml_monitoring_mrecv, - mca_pml_monitoring_dump, - NULL, - 65535, - INT_MAX -}; - -/** - * This PML monitors only the processes in the MPI_COMM_WORLD. As OMPI is now lazily - * adding peers on the first call to add_procs we need to check how many processes - * are in the MPI_COMM_WORLD to create the storage with the right size. - */ -int mca_pml_monitoring_add_procs(struct ompi_proc_t **procs, - size_t nprocs) -{ - opal_process_name_t tmp, wp_name; - size_t i, peer_rank, nprocs_world; - uint64_t key; - - if(NULL == translation_ht) { - translation_ht = OBJ_NEW(opal_hash_table_t); - opal_hash_table_init(translation_ht, 2048); - /* get my rank in the MPI_COMM_WORLD */ - my_rank = ompi_comm_rank((ompi_communicator_t*)&ompi_mpi_comm_world); - } - - nprocs_world = ompi_comm_size((ompi_communicator_t*)&ompi_mpi_comm_world); - /* For all procs in the same MPI_COMM_WORLD we need to add them to the hash table */ - for( i = 0; i < nprocs; i++ ) { - - /* Extract the peer procname from the procs array */ - if( ompi_proc_is_sentinel(procs[i]) ) { - tmp = ompi_proc_sentinel_to_name((uintptr_t)procs[i]); - } else { - tmp = procs[i]->super.proc_name; - } - if( tmp.jobid != ompi_proc_local_proc->super.proc_name.jobid ) - continue; - - for( peer_rank = 0; peer_rank < nprocs_world; peer_rank++ ) { - wp_name = ompi_group_get_proc_name(((ompi_communicator_t*)&ompi_mpi_comm_world)->c_remote_group, peer_rank); - if( 0 != opal_compare_proc( tmp, wp_name) ) - continue; - - /* Find the rank of the peer in MPI_COMM_WORLD */ - key = *((uint64_t*)&tmp); - /* store the rank (in COMM_WORLD) of the process - with its name (a uniq opal ID) as key in the hash table*/ - if( OPAL_SUCCESS != opal_hash_table_set_value_uint64(translation_ht, - key, (void*)(uintptr_t)peer_rank) ) { - return OMPI_ERR_OUT_OF_RESOURCE; /* failed to allocate memory or growing the hash table */ - } - break; - } - } - return pml_selected_module.pml_add_procs(procs, nprocs); -} - -/** - * Pass the information down the PML stack. - */ -int mca_pml_monitoring_del_procs(struct ompi_proc_t **procs, - size_t nprocs) -{ - return pml_selected_module.pml_del_procs(procs, nprocs); -} - -int mca_pml_monitoring_dump(struct ompi_communicator_t* comm, - int verbose) -{ - return pml_selected_module.pml_dump(comm, verbose); -} - - -void finalize_monitoring( void ) -{ - free(filtered_sent_data); - free(filtered_messages_count); - free(sent_data); - free(messages_count); - opal_hash_table_remove_all( translation_ht ); - free(translation_ht); -} - -/** - * We have delayed the initialization until the first send so that we know that - * the MPI_COMM_WORLD (which is the only communicator we are interested on at - * this point) is correctly initialized. - */ -static void initialize_monitoring( void ) -{ - nbprocs = ompi_comm_size((ompi_communicator_t*)&ompi_mpi_comm_world); - sent_data = (uint64_t*)calloc(nbprocs, sizeof(uint64_t)); - messages_count = (uint64_t*)calloc(nbprocs, sizeof(uint64_t)); - filtered_sent_data = (uint64_t*)calloc(nbprocs, sizeof(uint64_t)); - filtered_messages_count = (uint64_t*)calloc(nbprocs, sizeof(uint64_t)); - - init_done = 1; -} - -void mca_pml_monitoring_reset( void ) -{ - if( !init_done ) return; - memset(sent_data, 0, nbprocs * sizeof(uint64_t)); - memset(messages_count, 0, nbprocs * sizeof(uint64_t)); - memset(filtered_sent_data, 0, nbprocs * sizeof(uint64_t)); - memset(filtered_messages_count, 0, nbprocs * sizeof(uint64_t)); -} - -void monitor_send_data(int world_rank, size_t data_size, int tag) -{ - if( 0 == filter_monitoring() ) return; /* right now the monitoring is not started */ - - if ( !init_done ) - initialize_monitoring(); - - /* distinguishses positive and negative tags if requested */ - if( (tag < 0) && (1 == filter_monitoring()) ) { - filtered_sent_data[world_rank] += data_size; - filtered_messages_count[world_rank]++; - } else { /* if filtered monitoring is not activated data is aggregated indifferently */ - sent_data[world_rank] += data_size; - messages_count[world_rank]++; - } -} - -int mca_pml_monitoring_get_messages_count(const struct mca_base_pvar_t *pvar, - void *value, - void *obj_handle) -{ - ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; - int i, comm_size = ompi_comm_size (comm); - uint64_t *values = (uint64_t*) value; - - if(comm != &ompi_mpi_comm_world.comm || NULL == messages_count) - return OMPI_ERROR; - - for (i = 0 ; i < comm_size ; ++i) { - values[i] = messages_count[i]; - } - - return OMPI_SUCCESS; -} - -int mca_pml_monitoring_get_messages_size(const struct mca_base_pvar_t *pvar, - void *value, - void *obj_handle) -{ - ompi_communicator_t *comm = (ompi_communicator_t *) obj_handle; - int comm_size = ompi_comm_size (comm); - uint64_t *values = (uint64_t*) value; - int i; - - if(comm != &ompi_mpi_comm_world.comm || NULL == sent_data) - return OMPI_ERROR; - - for (i = 0 ; i < comm_size ; ++i) { - values[i] = sent_data[i]; - } - - return OMPI_SUCCESS; -} - -static void output_monitoring( FILE *pf ) -{ - if( 0 == filter_monitoring() ) return; /* if disabled do nothing */ - - for (int i = 0 ; i < nbprocs ; i++) { - if(sent_data[i] > 0) { - fprintf(pf, "I\t%d\t%d\t%" PRIu64 " bytes\t%" PRIu64 " msgs sent\n", - my_rank, i, sent_data[i], messages_count[i]); - } - } - - if( 1 == filter_monitoring() ) return; - - for (int i = 0 ; i < nbprocs ; i++) { - if(filtered_sent_data[i] > 0) { - fprintf(pf, "E\t%d\t%d\t%" PRIu64 " bytes\t%" PRIu64 " msgs sent\n", - my_rank, i, filtered_sent_data[i], filtered_messages_count[i]); - } - } -} - - -/* - Flushes the monitoring into filename - Useful for phases (see example in test/monitoring) -*/ -int ompi_mca_pml_monitoring_flush(char* filename) -{ - FILE *pf = stderr; - - if ( !init_done ) return -1; - - if( NULL != filename ) - pf = fopen(filename, "w"); - - if(!pf) - return -1; - - fprintf(stderr, "Proc %d flushing monitoring to: %s\n", my_rank, filename); - output_monitoring( pf ); - - if( NULL != filename ) - fclose(pf); - return 0; -} diff --git a/ompi/mca/pml/monitoring/pml_monitoring.h b/ompi/mca/pml/monitoring/pml_monitoring.h index efd9a5b0686..db9fe725476 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring.h +++ b/ompi/mca/pml/monitoring/pml_monitoring.h @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * Copyright (c) 2015 Bull SAS. All rights reserved. * $COPYRIGHT$ * @@ -20,14 +20,15 @@ BEGIN_C_DECLS #include #include #include -#include +#include +#include #include typedef mca_pml_base_module_t mca_pml_monitoring_module_t; extern mca_pml_base_component_t pml_selected_component; extern mca_pml_base_module_t pml_selected_module; -extern mca_pml_monitoring_module_t mca_pml_monitoring; +extern mca_pml_monitoring_module_t mca_pml_monitoring_module; OMPI_DECLSPEC extern mca_pml_base_component_2_0_0_t mca_pml_monitoring_component; /* @@ -38,11 +39,9 @@ extern int mca_pml_monitoring_add_comm(struct ompi_communicator_t* comm); extern int mca_pml_monitoring_del_comm(struct ompi_communicator_t* comm); -extern int mca_pml_monitoring_add_procs(struct ompi_proc_t **procs, - size_t nprocs); +extern int mca_pml_monitoring_add_procs(struct ompi_proc_t **procs, size_t nprocs); -extern int mca_pml_monitoring_del_procs(struct ompi_proc_t **procs, - size_t nprocs); +extern int mca_pml_monitoring_del_procs(struct ompi_proc_t **procs, size_t nprocs); extern int mca_pml_monitoring_enable(bool enable); @@ -138,20 +137,6 @@ extern int mca_pml_monitoring_dump(struct ompi_communicator_t* comm, extern int mca_pml_monitoring_start(size_t count, ompi_request_t** requests); -int mca_pml_monitoring_get_messages_count (const struct mca_base_pvar_t *pvar, - void *value, - void *obj_handle); - -int mca_pml_monitoring_get_messages_size (const struct mca_base_pvar_t *pvar, - void *value, - void *obj_handle); - -void finalize_monitoring( void ); -int filter_monitoring( void ); -void mca_pml_monitoring_reset( void ); -int ompi_mca_pml_monitoring_flush(char* filename); -void monitor_send_data(int world_rank, size_t data_size, int tag); - END_C_DECLS #endif /* MCA_PML_MONITORING_H */ diff --git a/ompi/mca/pml/monitoring/pml_monitoring_comm.c b/ompi/mca/pml/monitoring/pml_monitoring_comm.c index 1200f7ad714..44b7d0c9d69 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_comm.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_comm.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -11,7 +11,7 @@ */ #include -#include +#include "pml_monitoring.h" int mca_pml_monitoring_add_comm(struct ompi_communicator_t* comm) { diff --git a/ompi/mca/pml/monitoring/pml_monitoring_component.c b/ompi/mca/pml/monitoring/pml_monitoring_component.c index 540d414dca0..7c8bc6c1dd5 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_component.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_component.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2016 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * Copyright (c) 2015 Bull SAS. All rights reserved. * Copyright (c) 2015 Research Organization for Information Science * and Technology (RIST). All rights reserved. @@ -14,123 +14,81 @@ */ #include -#include +#include "pml_monitoring.h" #include #include +#include #include -static int mca_pml_monitoring_enabled = 0; static int mca_pml_monitoring_active = 0; -static int mca_pml_monitoring_current_state = 0; -static char* mca_pml_monitoring_current_filename = NULL; + mca_pml_base_component_t pml_selected_component = {{0}}; mca_pml_base_module_t pml_selected_module = {0}; -/* Return the current status of the monitoring system 0 if off, 1 if the - * seperation between internal tags and external tags is enabled. Any other - * positive value if the segregation between point-to-point and collective is - * disabled. - */ -int filter_monitoring( void ) -{ - return mca_pml_monitoring_current_state; -} - -static int -mca_pml_monitoring_set_flush(struct mca_base_pvar_t *pvar, const void *value, void *obj) -{ - if( NULL != mca_pml_monitoring_current_filename ) - free(mca_pml_monitoring_current_filename); - if( NULL == value ) /* No more output */ - mca_pml_monitoring_current_filename = NULL; - else { - mca_pml_monitoring_current_filename = strdup((char*)value); - if( NULL == mca_pml_monitoring_current_filename ) - return OMPI_ERROR; - } - return OMPI_SUCCESS; -} +mca_pml_monitoring_module_t mca_pml_monitoring_module = { + mca_pml_monitoring_add_procs, + mca_pml_monitoring_del_procs, + mca_pml_monitoring_enable, + NULL, + mca_pml_monitoring_add_comm, + mca_pml_monitoring_del_comm, + mca_pml_monitoring_irecv_init, + mca_pml_monitoring_irecv, + mca_pml_monitoring_recv, + mca_pml_monitoring_isend_init, + mca_pml_monitoring_isend, + mca_pml_monitoring_send, + mca_pml_monitoring_iprobe, + mca_pml_monitoring_probe, + mca_pml_monitoring_start, + mca_pml_monitoring_improbe, + mca_pml_monitoring_mprobe, + mca_pml_monitoring_imrecv, + mca_pml_monitoring_mrecv, + mca_pml_monitoring_dump, + NULL, + 65535, + INT_MAX +}; -static int -mca_pml_monitoring_get_flush(const struct mca_base_pvar_t *pvar, void *value, void *obj) +/** + * This PML monitors only the processes in the MPI_COMM_WORLD. As OMPI is now lazily + * adding peers on the first call to add_procs we need to check how many processes + * are in the MPI_COMM_WORLD to create the storage with the right size. + */ +int mca_pml_monitoring_add_procs(struct ompi_proc_t **procs, + size_t nprocs) { - return OMPI_SUCCESS; + int ret = mca_common_monitoring_add_procs(procs, nprocs); + if( OMPI_SUCCESS == ret ) + ret = pml_selected_module.pml_add_procs(procs, nprocs); + return ret; } -static int -mca_pml_monitoring_notify_flush(struct mca_base_pvar_t *pvar, mca_base_pvar_event_t event, - void *obj, int *count) +/** + * Pass the information down the PML stack. + */ +int mca_pml_monitoring_del_procs(struct ompi_proc_t **procs, + size_t nprocs) { - switch (event) { - case MCA_BASE_PVAR_HANDLE_BIND: - mca_pml_monitoring_reset(); - *count = (NULL == mca_pml_monitoring_current_filename ? 0 : strlen(mca_pml_monitoring_current_filename)); - case MCA_BASE_PVAR_HANDLE_UNBIND: - return OMPI_SUCCESS; - case MCA_BASE_PVAR_HANDLE_START: - mca_pml_monitoring_current_state = mca_pml_monitoring_enabled; - return OMPI_SUCCESS; - case MCA_BASE_PVAR_HANDLE_STOP: - if( 0 == ompi_mca_pml_monitoring_flush(mca_pml_monitoring_current_filename) ) - return OMPI_SUCCESS; - } - return OMPI_ERROR; + return pml_selected_module.pml_del_procs(procs, nprocs); } -static int -mca_pml_monitoring_messages_notify(mca_base_pvar_t *pvar, - mca_base_pvar_event_t event, - void *obj_handle, - int *count) +int mca_pml_monitoring_dump(struct ompi_communicator_t* comm, + int verbose) { - switch (event) { - case MCA_BASE_PVAR_HANDLE_BIND: - /* Return the size of the communicator as the number of values */ - *count = ompi_comm_size ((ompi_communicator_t *) obj_handle); - case MCA_BASE_PVAR_HANDLE_UNBIND: - return OMPI_SUCCESS; - case MCA_BASE_PVAR_HANDLE_START: - mca_pml_monitoring_current_state = mca_pml_monitoring_enabled; - return OMPI_SUCCESS; - case MCA_BASE_PVAR_HANDLE_STOP: - mca_pml_monitoring_current_state = 0; - return OMPI_SUCCESS; - } - - return OMPI_ERROR; + return pml_selected_module.pml_dump(comm, verbose); } int mca_pml_monitoring_enable(bool enable) { - /* If we reach this point we were succesful at hijacking the interface of - * the real PML, and we are now correctly interleaved between the upper - * layer and the real PML. - */ - (void)mca_base_pvar_register("ompi", "pml", "monitoring", "flush", "Flush the monitoring information" - "in the provided file", OPAL_INFO_LVL_1, MCA_BASE_PVAR_CLASS_GENERIC, - MCA_BASE_VAR_TYPE_STRING, NULL, MPI_T_BIND_NO_OBJECT, - 0, - mca_pml_monitoring_get_flush, mca_pml_monitoring_set_flush, - mca_pml_monitoring_notify_flush, &mca_pml_monitoring_component); - - (void)mca_base_pvar_register("ompi", "pml", "monitoring", "messages_count", "Number of messages " - "sent to each peer in a communicator", OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, - MCA_BASE_VAR_TYPE_UNSIGNED_LONG, NULL, MPI_T_BIND_MPI_COMM, - MCA_BASE_PVAR_FLAG_READONLY, - mca_pml_monitoring_get_messages_count, NULL, mca_pml_monitoring_messages_notify, NULL); - - (void)mca_base_pvar_register("ompi", "pml", "monitoring", "messages_size", "Size of messages " - "sent to each peer in a communicator", OPAL_INFO_LVL_4, MPI_T_PVAR_CLASS_SIZE, - MCA_BASE_VAR_TYPE_UNSIGNED_LONG, NULL, MPI_T_BIND_MPI_COMM, - MCA_BASE_PVAR_FLAG_READONLY, - mca_pml_monitoring_get_messages_size, NULL, mca_pml_monitoring_messages_notify, NULL); - return pml_selected_module.pml_enable(enable); } static int mca_pml_monitoring_component_open(void) { - if( mca_pml_monitoring_enabled ) { + /* CF: What if we are the only PML available ?? */ + if( mca_common_monitoring_enabled ) { opal_pointer_array_add(&mca_pml_base_pml, strdup(mca_pml_monitoring_component.pmlm_version.mca_component_name)); } @@ -139,22 +97,15 @@ static int mca_pml_monitoring_component_open(void) static int mca_pml_monitoring_component_close(void) { - if( NULL != mca_pml_monitoring_current_filename ) { - free(mca_pml_monitoring_current_filename); - mca_pml_monitoring_current_filename = NULL; - } - if( !mca_pml_monitoring_enabled ) - return OMPI_SUCCESS; + if( !mca_common_monitoring_enabled ) return OMPI_SUCCESS; /** - * If this component is already active, then we are currently monitoring the execution - * and this close if the one from MPI_Finalize. Do the clean up and release the extra - * reference on ourselves. + * If this component is already active, then we are currently monitoring + * the execution and this call to close if the one from MPI_Finalize. + * Clean up and release the extra reference on ourselves. */ if( mca_pml_monitoring_active ) { /* Already active, turn off */ pml_selected_component.pmlm_version.mca_close_component(); - memset(&pml_selected_component, 0, sizeof(mca_pml_base_component_t)); - memset(&pml_selected_module, 0, sizeof(mca_pml_base_module_t)); mca_base_component_repository_release((mca_base_component_t*)&mca_pml_monitoring_component); mca_pml_monitoring_active = 0; return OMPI_SUCCESS; @@ -175,12 +126,13 @@ static int mca_pml_monitoring_component_close(void) pml_selected_module = mca_pml; /* Install our interception layer */ mca_pml_base_selected_component = mca_pml_monitoring_component; - mca_pml = mca_pml_monitoring; - /* Restore some of the original valued: progress, flags, tags and context id */ + mca_pml = mca_pml_monitoring_module; + /* Restore some of the original values: progress, flags, tags and context id */ mca_pml.pml_progress = pml_selected_module.pml_progress; mca_pml.pml_max_contextid = pml_selected_module.pml_max_contextid; mca_pml.pml_max_tag = pml_selected_module.pml_max_tag; - mca_pml.pml_flags = pml_selected_module.pml_flags; + /* Add MCA_PML_BASE_FLAG_REQUIRE_WORLD flag to ensure the hashtable is properly initialized */ + mca_pml.pml_flags = pml_selected_module.pml_flags | MCA_PML_BASE_FLAG_REQUIRE_WORLD; mca_pml_monitoring_active = 1; @@ -192,44 +144,36 @@ mca_pml_monitoring_component_init(int* priority, bool enable_progress_threads, bool enable_mpi_threads) { - if( mca_pml_monitoring_enabled ) { + mca_common_monitoring_init(); + if( mca_common_monitoring_enabled ) { *priority = 0; /* I'm up but don't select me */ - return &mca_pml_monitoring; + return &mca_pml_monitoring_module; } return NULL; } static int mca_pml_monitoring_component_finish(void) { - if( mca_pml_monitoring_enabled && mca_pml_monitoring_active ) { + if( mca_common_monitoring_enabled && mca_pml_monitoring_active ) { /* Free internal data structure */ - finalize_monitoring(); - /* Call the original PML and then close */ - mca_pml_monitoring_active = 0; - mca_pml_monitoring_enabled = 0; + mca_common_monitoring_finalize(); /* Restore the original PML */ mca_pml_base_selected_component = pml_selected_component; mca_pml = pml_selected_module; /* Redirect the close call to the original PML */ pml_selected_component.pmlm_finalize(); /** - * We should never release the last ref on the current component or face forever punishement. + * We should never release the last ref on the current + * component or face forever punishement. */ - /* mca_base_component_repository_release(&mca_pml_monitoring_component.pmlm_version); */ + /* mca_base_component_repository_release(&mca_common_monitoring_component.pmlm_version); */ } return OMPI_SUCCESS; } static int mca_pml_monitoring_component_register(void) { - (void)mca_base_component_var_register(&mca_pml_monitoring_component.pmlm_version, "enable", - "Enable the monitoring at the PML level. A value of 0 will disable the monitoring (default). " - "A value of 1 will aggregate all monitoring information (point-to-point and collective). " - "Any other value will enable filtered monitoring", - MCA_BASE_VAR_TYPE_INT, NULL, 0, 0, - OPAL_INFO_LVL_4, - MCA_BASE_VAR_SCOPE_READONLY, &mca_pml_monitoring_enabled); - + mca_common_monitoring_register(&mca_pml_monitoring_component); return OMPI_SUCCESS; } @@ -242,9 +186,7 @@ mca_pml_base_component_2_0_0_t mca_pml_monitoring_component = { MCA_PML_BASE_VERSION_2_0_0, .mca_component_name = "monitoring", /* MCA component name */ - .mca_component_major_version = OMPI_MAJOR_VERSION, /* MCA component major version */ - .mca_component_minor_version = OMPI_MINOR_VERSION, /* MCA component minor version */ - .mca_component_release_version = OMPI_RELEASE_VERSION, /* MCA component release version */ + MCA_MONITORING_MAKE_VERSION, .mca_open_component = mca_pml_monitoring_component_open, /* component open */ .mca_close_component = mca_pml_monitoring_component_close, /* component close */ .mca_register_component_params = mca_pml_monitoring_component_register @@ -256,6 +198,5 @@ mca_pml_base_component_2_0_0_t mca_pml_monitoring_component = { .pmlm_init = mca_pml_monitoring_component_init, /* component init */ .pmlm_finalize = mca_pml_monitoring_component_finish /* component finalize */ - }; diff --git a/ompi/mca/pml/monitoring/pml_monitoring_iprobe.c b/ompi/mca/pml/monitoring/pml_monitoring_iprobe.c index ec34cb5d27c..42bc7ba257c 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_iprobe.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_iprobe.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -11,7 +11,7 @@ */ #include -#include +#include "pml_monitoring.h" /* EJ: nothing to do here */ diff --git a/ompi/mca/pml/monitoring/pml_monitoring_irecv.c b/ompi/mca/pml/monitoring/pml_monitoring_irecv.c index 91b247c7c53..7c3fa8aa4d2 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_irecv.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_irecv.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -11,7 +11,7 @@ */ #include -#include +#include "pml_monitoring.h" /* EJ: loging is done on the sender. Nothing to do here */ diff --git a/ompi/mca/pml/monitoring/pml_monitoring_isend.c b/ompi/mca/pml/monitoring/pml_monitoring_isend.c index 1c88fd268bf..727a5dc30fd 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_isend.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_isend.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -11,9 +11,7 @@ */ #include -#include - -extern opal_hash_table_t *translation_ht; +#include "pml_monitoring.h" int mca_pml_monitoring_isend_init(const void *buf, size_t count, @@ -37,22 +35,16 @@ int mca_pml_monitoring_isend(const void *buf, struct ompi_communicator_t* comm, struct ompi_request_t **request) { - - /* find the processor of teh destination */ - ompi_proc_t *proc = ompi_group_get_proc_ptr(comm->c_remote_group, dst, true); int world_rank; - - /* find its name*/ - uint64_t key = *((uint64_t*)&(proc->super.proc_name)); /** * If this fails the destination is not part of my MPI_COM_WORLD * Lookup its name in the rank hastable to get its MPI_COMM_WORLD rank */ - if(OPAL_SUCCESS == opal_hash_table_get_value_uint64(translation_ht, key, (void *)&world_rank)) { + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(dst, comm, &world_rank)) { size_t type_size, data_size; ompi_datatype_type_size(datatype, &type_size); data_size = count*type_size; - monitor_send_data(world_rank, data_size, tag); + mca_common_monitoring_record_pml(world_rank, data_size, tag); } return pml_selected_module.pml_isend(buf, count, datatype, @@ -67,19 +59,15 @@ int mca_pml_monitoring_send(const void *buf, mca_pml_base_send_mode_t mode, struct ompi_communicator_t* comm) { - ompi_proc_t *proc = ompi_group_get_proc_ptr(comm->c_remote_group, dst, true); int world_rank; - uint64_t key = *((uint64_t*) &(proc->super.proc_name)); - /* Are we sending to a peer from my own MPI_COMM_WORLD? */ - if(OPAL_SUCCESS == opal_hash_table_get_value_uint64(translation_ht, key, (void *)&world_rank)) { + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(dst, comm, &world_rank)) { size_t type_size, data_size; ompi_datatype_type_size(datatype, &type_size); data_size = count*type_size; - monitor_send_data(world_rank, data_size, tag); + mca_common_monitoring_record_pml(world_rank, data_size, tag); } return pml_selected_module.pml_send(buf, count, datatype, dst, tag, mode, comm); } - diff --git a/ompi/mca/pml/monitoring/pml_monitoring_start.c b/ompi/mca/pml/monitoring/pml_monitoring_start.c index fbdebac1c27..17d91165d60 100644 --- a/ompi/mca/pml/monitoring/pml_monitoring_start.c +++ b/ompi/mca/pml/monitoring/pml_monitoring_start.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -11,12 +11,9 @@ */ #include -#include -#include +#include "pml_monitoring.h" #include -extern opal_hash_table_t *translation_ht; - /* manage persistant requests*/ int mca_pml_monitoring_start(size_t count, ompi_request_t** requests) @@ -25,7 +22,6 @@ int mca_pml_monitoring_start(size_t count, for( i = 0; i < count; i++ ) { mca_pml_base_request_t *pml_request = (mca_pml_base_request_t*)requests[i]; - ompi_proc_t *proc; int world_rank; if(NULL == pml_request) { @@ -38,18 +34,15 @@ int mca_pml_monitoring_start(size_t count, continue; } - proc = ompi_group_get_proc_ptr(pml_request->req_comm->c_remote_group, pml_request->req_peer, true); - uint64_t key = *((uint64_t*) &(proc->super.proc_name)); - - /** * If this fails the destination is not part of my MPI_COM_WORLD */ - if(OPAL_SUCCESS == opal_hash_table_get_value_uint64(translation_ht, key, (void *)&world_rank)) { + if(OPAL_SUCCESS == mca_common_monitoring_get_world_rank(pml_request->req_peer, + pml_request->req_comm, &world_rank)) { size_t type_size, data_size; ompi_datatype_type_size(pml_request->req_datatype, &type_size); data_size = pml_request->req_count * type_size; - monitor_send_data(world_rank, data_size, 1); + mca_common_monitoring_record_pml(world_rank, data_size, 1); } } return pml_selected_module.pml_start(count, requests); diff --git a/opal/mca/base/mca_base_pvar.c b/opal/mca/base/mca_base_pvar.c index 7decb8ab6f2..1c4f043ec76 100644 --- a/opal/mca/base/mca_base_pvar.c +++ b/opal/mca/base/mca_base_pvar.c @@ -719,6 +719,8 @@ int mca_base_pvar_handle_write_value (mca_base_pvar_handle_t *handle, const void } memmove (handle->current_value, value, handle->count * var_type_sizes[handle->pvar->type]); + /* read the value directly from the variable. */ + ret = handle->pvar->set_value (handle->pvar, value, handle->obj_handle); return OPAL_SUCCESS; } diff --git a/test/monitoring/Makefile.am b/test/monitoring/Makefile.am index 469c104ed2d..93ec737ea99 100644 --- a/test/monitoring/Makefile.am +++ b/test/monitoring/Makefile.am @@ -1,12 +1,12 @@ # -# Copyright (c) 2013-2015 The University of Tennessee and The University +# Copyright (c) 2013-2017 The University of Tennessee and The University # of Tennessee Research Foundation. All rights # reserved. -# Copyright (c) 2013-2015 Inria. All rights reserved. -# Copyright (c) 2015 Research Organization for Information Science +# Copyright (c) 2013-2017 Inria. All rights reserved. +# Copyright (c) 2015-2017 Research Organization for Information Science # and Technology (RIST). All rights reserved. # Copyright (c) 2016 IBM Corporation. All rights reserved. -# Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -14,15 +14,37 @@ # $HEADER$ # +EXTRA_DIST = profile2mat.pl aggregate_profile.pl + # This test requires multiple processes to run. Don't run it as part # of 'make check' if PROJECT_OMPI - noinst_PROGRAMS = monitoring_test + noinst_PROGRAMS = monitoring_test test_pvar_access test_overhead check_monitoring example_reduce_count monitoring_test_SOURCES = monitoring_test.c monitoring_test_LDFLAGS = $(OMPI_PKG_CONFIG_LDFLAGS) monitoring_test_LDADD = \ $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la \ $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la + test_pvar_access_SOURCES = test_pvar_access.c + test_pvar_access_LDFLAGS = $(OMPI_PKG_CONFIG_LDFLAGS) + test_pvar_access_LDADD = \ + $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la \ + $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la + test_overhead_SOURCES = test_overhead.c + test_overhead_LDFLAGS = $(OMPI_PKG_CONFIG_LDFLAGS) + test_overhead_LDADD = \ + $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la \ + $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la + check_monitoring_SOURCES = check_monitoring.c + check_monitoring_LDFLAGS = $(OMPI_PKG_CONFIG_LDFLAGS) + check_monitoring_LDADD = \ + $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la \ + $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la + example_reduce_count_SOURCES = example_reduce_count.c + example_reduce_count_LDFLAGS = $(OMPI_PKG_CONFIG_LDFLAGS) + example_reduce_count_LDADD = \ + $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la \ + $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la if MCA_BUILD_ompi_pml_monitoring_DSO lib_LTLIBRARIES = ompi_monitoring_prof.la @@ -34,4 +56,11 @@ if MCA_BUILD_ompi_pml_monitoring_DSO $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la endif # MCA_BUILD_ompi_pml_monitoring_DSO +if OPAL_INSTALL_BINARIES +bin_SCRIPTS = profile2mat.pl aggregate_profile.pl +endif # OPAL_INSTALL_BINARIES + endif # PROJECT_OMPI + +distclean: + rm -rf *.dSYM .deps .libs *.la *.lo monitoring_test test_pvar_access test_overhead check_monitoring example_reduce_count prof *.log *.o *.trs Makefile diff --git a/test/monitoring/aggregate_profile.pl b/test/monitoring/aggregate_profile.pl index da6d3780b00..2af537b5ae0 100644 --- a/test/monitoring/aggregate_profile.pl +++ b/test/monitoring/aggregate_profile.pl @@ -28,7 +28,7 @@ # ensure that this script as the executable right: chmod +x ... # -die "$0 \n\tProfile files should be of the form \"name_phaseid_processesid.prof\"\n\tFor instance if you saved the monitoring into phase_0_0.prof, phase_0_1.prof, ..., phase_1_0.prof etc you should call: $0 phase\n" if ($#ARGV!=0); +die "$0 \n\tProfile files should be of the form \"name_phaseid_processesid.prof\"\n\tFor instance if you saved the monitoring into phase_0.0.prof, phase_0.1.prof, ..., phase_1.0.prof etc you should call: $0 phase\n" if ($#ARGV!=0); $name = $ARGV[0]; @@ -39,7 +39,7 @@ # Detect the different phases foreach $file (@files) { - ($id)=($file =~ m/$name\_(\d+)_\d+/); + ($id)=($file =~ m/$name\_(\d+)\.\d+/); $phaseid{$id} = 1 if ($id); } diff --git a/test/monitoring/check_monitoring.c b/test/monitoring/check_monitoring.c new file mode 100644 index 00000000000..50c00769228 --- /dev/null +++ b/test/monitoring/check_monitoring.c @@ -0,0 +1,516 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 The University of Tennessee and The University + * of Tennessee Research Foundation. All rights + * reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + Check the well working of the monitoring component for Open-MPI. + + To be run as: + + mpirun -np 4 --mca pml_monitoring_enable 2 ./check_monitoring +*/ + +#include +#include +#include +#include + +#define PVAR_GENERATE_VARIABLES(pvar_prefix, pvar_name, pvar_class) \ + /* Variables */ \ + static MPI_T_pvar_handle pvar_prefix ## _handle; \ + static const char pvar_prefix ## _pvar_name[] = pvar_name; \ + static int pvar_prefix ## _pvar_idx; \ + /* Functions */ \ + static inline int pvar_prefix ## _start(MPI_T_pvar_session session) \ + { \ + int MPIT_result; \ + MPIT_result = MPI_T_pvar_start(session, pvar_prefix ## _handle); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Failed to start handle on \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + } \ + return MPIT_result; \ + } \ + static inline int pvar_prefix ## _init(MPI_T_pvar_session session) \ + { \ + int MPIT_result; \ + /* Get index */ \ + MPIT_result = MPI_T_pvar_get_index(pvar_prefix ## _pvar_name, \ + pvar_class, \ + &(pvar_prefix ## _pvar_idx)); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Cannot find monitoring MPI_Tool \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + return MPIT_result; \ + } \ + /* Allocate handle */ \ + /* Allocating a new PVAR in a session will reset the counters */ \ + int count; \ + MPIT_result = MPI_T_pvar_handle_alloc(session, pvar_prefix ## _pvar_idx, \ + MPI_COMM_WORLD, &(pvar_prefix ## _handle), \ + &count); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Failed to allocate handle on \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + return MPIT_result; \ + } \ + /* Start PVAR */ \ + return pvar_prefix ## _start(session); \ + } \ + static inline int pvar_prefix ## _stop(MPI_T_pvar_session session) \ + { \ + int MPIT_result; \ + MPIT_result = MPI_T_pvar_stop(session, pvar_prefix ## _handle); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Failed to stop handle on \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + } \ + return MPIT_result; \ + } \ + static inline int pvar_prefix ## _finalize(MPI_T_pvar_session session) \ + { \ + int MPIT_result; \ + /* Stop PVAR */ \ + MPIT_result = pvar_prefix ## _stop(session); \ + /* Free handle */ \ + MPIT_result = MPI_T_pvar_handle_free(session, &(pvar_prefix ## _handle)); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Failed to allocate handle on \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + return MPIT_result; \ + } \ + return MPIT_result; \ + } \ + static inline int pvar_prefix ## _read(MPI_T_pvar_session session, void*values) \ + { \ + int MPIT_result; \ + /* Stop pvar */ \ + MPIT_result = pvar_prefix ## _stop(session); \ + /* Read values */ \ + MPIT_result = MPI_T_pvar_read(session, pvar_prefix ## _handle, values); \ + if( MPI_SUCCESS != MPIT_result ) { \ + fprintf(stderr, "Failed to read handle on \"%s\" pvar, check that you have " \ + "enabled the monitoring component.\n", pvar_prefix ## _pvar_name); \ + MPI_Abort(MPI_COMM_WORLD, MPIT_result); \ + } \ + /* Start and return */ \ + return pvar_prefix ## _start(session); \ + } + +#define GENERATE_CS(prefix, pvar_name_prefix, pvar_class_c, pvar_class_s) \ + PVAR_GENERATE_VARIABLES(prefix ## _count, pvar_name_prefix "_count", pvar_class_c) \ + PVAR_GENERATE_VARIABLES(prefix ## _size, pvar_name_prefix "_size", pvar_class_s) \ + static inline int pvar_ ## prefix ## _init(MPI_T_pvar_session session) \ + { \ + prefix ## _count_init(session); \ + return prefix ## _size_init(session); \ + } \ + static inline int pvar_ ## prefix ## _finalize(MPI_T_pvar_session session) \ + { \ + prefix ## _count_finalize(session); \ + return prefix ## _size_finalize(session); \ + } \ + static inline void pvar_ ## prefix ## _read(MPI_T_pvar_session session, \ + size_t*cvalues, size_t*svalues) \ + { \ + /* Read count values */ \ + prefix ## _count_read(session, cvalues); \ + /* Read size values */ \ + prefix ## _size_read(session, svalues); \ + } + +GENERATE_CS(pml, "pml_monitoring_messages", MPI_T_PVAR_CLASS_SIZE, MPI_T_PVAR_CLASS_SIZE) +GENERATE_CS(osc_s, "osc_monitoring_messages_sent", MPI_T_PVAR_CLASS_SIZE, MPI_T_PVAR_CLASS_SIZE) +GENERATE_CS(osc_r, "osc_monitoring_messages_recv", MPI_T_PVAR_CLASS_SIZE, MPI_T_PVAR_CLASS_SIZE) +GENERATE_CS(coll, "coll_monitoring_messages", MPI_T_PVAR_CLASS_SIZE, MPI_T_PVAR_CLASS_SIZE) +GENERATE_CS(o2a, "coll_monitoring_o2a", MPI_T_PVAR_CLASS_COUNTER, MPI_T_PVAR_CLASS_AGGREGATE) +GENERATE_CS(a2o, "coll_monitoring_a2o", MPI_T_PVAR_CLASS_COUNTER, MPI_T_PVAR_CLASS_AGGREGATE) +GENERATE_CS(a2a, "coll_monitoring_a2a", MPI_T_PVAR_CLASS_COUNTER, MPI_T_PVAR_CLASS_AGGREGATE) + +static size_t *old_cvalues, *old_svalues; + +static inline void pvar_all_init(MPI_T_pvar_session*session, int world_size) +{ + int MPIT_result, provided; + MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); + if (MPIT_result != MPI_SUCCESS) { + fprintf(stderr, "Failed to initialiaze MPI_Tools sub-system.\n"); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_session_create(session); + if (MPIT_result != MPI_SUCCESS) { + printf("Failed to create a session for PVARs.\n"); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + old_cvalues = malloc(2 * world_size * sizeof(size_t)); + old_svalues = old_cvalues + world_size; + pvar_pml_init(*session); + pvar_osc_s_init(*session); + pvar_osc_r_init(*session); + pvar_coll_init(*session); + pvar_o2a_init(*session); + pvar_a2o_init(*session); + pvar_a2a_init(*session); +} + +static inline void pvar_all_finalize(MPI_T_pvar_session*session) +{ + int MPIT_result; + pvar_pml_finalize(*session); + pvar_osc_s_finalize(*session); + pvar_osc_r_finalize(*session); + pvar_coll_finalize(*session); + pvar_o2a_finalize(*session); + pvar_a2o_finalize(*session); + pvar_a2a_finalize(*session); + free(old_cvalues); + MPIT_result = MPI_T_pvar_session_free(session); + if (MPIT_result != MPI_SUCCESS) { + printf("Failed to close a session for PVARs.\n"); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + (void)MPI_T_finalize(); +} + +static inline int pvar_pml_check(MPI_T_pvar_session session, int world_size, int world_rank) +{ + int i, ret = MPI_SUCCESS; + size_t *cvalues, *svalues; + cvalues = malloc(2 * world_size * sizeof(size_t)); + svalues = cvalues + world_size; + /* Get values */ + pvar_pml_read(session, cvalues, svalues); + for( i = 0; i < world_size && MPI_SUCCESS == ret; ++i ) { + /* Check count values */ + if( i == world_rank && (cvalues[i] - old_cvalues[i]) != (size_t) 0 ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be equal to %zu.\n", + __func__, i, cvalues[i] - old_cvalues[i], (size_t) 0); + ret = -1; + } else if ( i != world_rank && (cvalues[i] - old_cvalues[i]) < (size_t) world_size ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, cvalues[i] - old_cvalues[i], (size_t) world_size); + ret = -1; + } + /* Check size values */ + if( i == world_rank && (svalues[i] - old_svalues[i]) != (size_t) 0 ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be equal to %zu.\n", + __func__, i, svalues[i] - old_svalues[i], (size_t) 0); + ret = -1; + } else if ( i != world_rank && (svalues[i] - old_svalues[i]) < (size_t) (world_size * 13 * sizeof(char)) ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, svalues[i] - old_svalues[i], (size_t) (world_size * 13 * sizeof(char))); + ret = -1; + } + } + if( MPI_SUCCESS == ret ) { + fprintf(stdout, "Check PML...[ OK ]\n"); + } else { + fprintf(stdout, "Check PML...[FAIL]\n"); + } + /* Keep old PML values */ + memcpy(old_cvalues, cvalues, 2 * world_size * sizeof(size_t)); + /* Free arrays */ + free(cvalues); + return ret; +} + +static inline int pvar_osc_check(MPI_T_pvar_session session, int world_size, int world_rank) +{ + int i, ret = MPI_SUCCESS; + size_t *cvalues, *svalues; + cvalues = malloc(2 * world_size * sizeof(size_t)); + svalues = cvalues + world_size; + /* Get OSC values */ + memset(cvalues, 0, 2 * world_size * sizeof(size_t)); + /* Check OSC sent values */ + pvar_osc_s_read(session, cvalues, svalues); + for( i = 0; i < world_size && MPI_SUCCESS == ret; ++i ) { + /* Check count values */ + if( cvalues[i] < (size_t) world_size ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, cvalues[i], (size_t) world_size); + ret = -1; + } + /* Check size values */ + if( svalues[i] < (size_t) (world_size * 13 * sizeof(char)) ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, svalues[i], (size_t) (world_size * 13 * sizeof(char))); + ret = -1; + } + } + /* Check OSC received values */ + pvar_osc_r_read(session, cvalues, svalues); + for( i = 0; i < world_size && MPI_SUCCESS == ret; ++i ) { + /* Check count values */ + if( cvalues[i] < (size_t) world_size ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, cvalues[i], (size_t) world_size); + ret = -1; + } + /* Check size values */ + if( svalues[i] < (size_t) (world_size * 13 * sizeof(char)) ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, svalues[i], (size_t) (world_size * 13 * sizeof(char))); + ret = -1; + } + } + if( MPI_SUCCESS == ret ) { + fprintf(stdout, "Check OSC...[ OK ]\n"); + } else { + fprintf(stdout, "Check OSC...[FAIL]\n"); + } + /* Keep old PML values */ + memcpy(old_cvalues, cvalues, 2 * world_size * sizeof(size_t)); + /* Free arrays */ + free(cvalues); + return ret; +} + +static inline int pvar_coll_check(MPI_T_pvar_session session, int world_size, int world_rank) { + int i, ret = MPI_SUCCESS; + size_t count, size; + size_t *cvalues, *svalues; + cvalues = malloc(2 * world_size * sizeof(size_t)); + svalues = cvalues + world_size; + /* Get COLL values */ + pvar_coll_read(session, cvalues, svalues); + for( i = 0; i < world_size && MPI_SUCCESS == ret; ++i ) { + /* Check count values */ + if( i == world_rank && cvalues[i] != (size_t) 0 ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be equal to %zu.\n", + __func__, i, cvalues[i], (size_t) 0); + ret = -1; + } else if ( i != world_rank && cvalues[i] < (size_t) (world_size + 1) * 4 ) { + fprintf(stderr, "Error in %s: count_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, cvalues[i], (size_t) (world_size + 1) * 4); + ret = -1; + } + /* Check size values */ + if( i == world_rank && svalues[i] != (size_t) 0 ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be equal to %zu.\n", + __func__, i, svalues[i], (size_t) 0); + ret = -1; + } else if ( i != world_rank && svalues[i] < (size_t) (world_size * (2 * 13 * sizeof(char) + sizeof(int)) + 13 * 3 * sizeof(char) + sizeof(int)) ) { + fprintf(stderr, "Error in %s: size_values[%d]=%zu, and should be >= %zu.\n", + __func__, i, svalues[i], (size_t) (world_size * (2 * 13 * sizeof(char) + sizeof(int)) + 13 * 3 * sizeof(char) + sizeof(int))); + ret = -1; + } + } + /* Check One-to-all COLL values */ + pvar_o2a_read(session, &count, &size); + if( count < (size_t) 2 ) { + fprintf(stderr, "Error in %s: count_o2a=%zu, and should be >= %zu.\n", + __func__, count, (size_t) 2); + ret = -1; + } + if( size < (size_t) ((world_size - 1) * 13 * 2 * sizeof(char)) ) { + fprintf(stderr, "Error in %s: size_o2a=%zu, and should be >= %zu.\n", + __func__, size, (size_t) ((world_size - 1) * 13 * 2 * sizeof(char))); + ret = -1; + } + /* Check All-to-one COLL values */ + pvar_a2o_read(session, &count, &size); + if( count < (size_t) 2 ) { + fprintf(stderr, "Error in %s: count_a2o=%zu, and should be >= %zu.\n", + __func__, count, (size_t) 2); + ret = -1; + } + if( size < (size_t) ((world_size - 1) * (13 * sizeof(char) + sizeof(int))) ) { + fprintf(stderr, "Error in %s: size_a2o=%zu, and should be >= %zu.\n", + __func__, size, + (size_t) ((world_size - 1) * (13 * sizeof(char) + sizeof(int)))); + ret = -1; + } + /* Check All-to-all COLL values */ + pvar_a2a_read(session, &count, &size); + if( count < (size_t) (world_size * 4) ) { + fprintf(stderr, "Error in %s: count_a2a=%zu, and should be >= %zu.\n", + __func__, count, (size_t) (world_size * 4)); + ret = -1; + } + if( size < (size_t) (world_size * (world_size - 1) * (2 * 13 * sizeof(char) + sizeof(int))) ) { + fprintf(stderr, "Error in %s: size_a2a=%zu, and should be >= %zu.\n", + __func__, size, + (size_t) (world_size * (world_size - 1) * (2 * 13 * sizeof(char) + sizeof(int)))); + ret = -1; + } + if( MPI_SUCCESS == ret ) { + fprintf(stdout, "Check COLL...[ OK ]\n"); + } else { + fprintf(stdout, "Check COLL...[FAIL]\n"); + } + /* Keep old PML values */ + pvar_pml_read(session, old_cvalues, old_svalues); + /* Free arrays */ + free(cvalues); + return ret; +} + +int main(int argc, char* argv[]) +{ + int size, i, n, to, from, world_rank; + MPI_T_pvar_session session; + MPI_Status status; + char s1[20], s2[20]; + strncpy(s1, "hello world!", 13); + + MPI_Init(NULL, NULL); + MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + + pvar_all_init(&session, size); + + /* first phase: exchange size times data with everyone in + MPI_COMM_WORLD with collective operations. This phase comes + first in order to ease the prediction of messages exchanged of + each kind. + */ + char*coll_buff = malloc(2 * size * 13 * sizeof(char)); + char*coll_recv_buff = coll_buff + size * 13; + int sum_ranks; + for( n = 0; n < size; ++n ) { + /* Allgather */ + memset(coll_buff, 0, size * 13 * sizeof(char)); + MPI_Allgather(s1, 13, MPI_CHAR, coll_buff, 13, MPI_CHAR, MPI_COMM_WORLD); + for( i = 0; i < size; ++i ) { + if( strncmp(s1, &coll_buff[i * 13], 13) ) { + fprintf(stderr, "Error in Allgather check: received \"%s\" instead of " + "\"hello world!\" from %d.\n", &coll_buff[i * 13], i); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + /* Scatter */ + MPI_Scatter(coll_buff, 13, MPI_CHAR, s2, 13, MPI_CHAR, n, MPI_COMM_WORLD); + if( strncmp(s1, s2, 13) ) { + fprintf(stderr, "Error in Scatter check: received \"%s\" instead of " + "\"hello world!\" from %d.\n", s2, n); + MPI_Abort(MPI_COMM_WORLD, -1); + } + /* Allreduce */ + MPI_Allreduce(&world_rank, &sum_ranks, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); + if( sum_ranks != ((size - 1) * size / 2) ) { + fprintf(stderr, "Error in Allreduce check: sum_ranks=%d instead of %d.\n", + sum_ranks, (size - 1) * size / 2); + MPI_Abort(MPI_COMM_WORLD, -1); + } + /* Alltoall */ + memset(coll_recv_buff, 0, size * 13 * sizeof(char)); + MPI_Alltoall(coll_buff, 13, MPI_CHAR, coll_recv_buff, 13, MPI_CHAR, MPI_COMM_WORLD); + for( i = 0; i < size; ++i ) { + if( strncmp(s1, &coll_recv_buff[i * 13], 13) ) { + fprintf(stderr, "Error in Alltoall check: received \"%s\" instead of " + "\"hello world!\" from %d.\n", &coll_recv_buff[i * 13], i); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + /* Bcast */ + if( n == world_rank ) { + MPI_Bcast(s1, 13, MPI_CHAR, n, MPI_COMM_WORLD); + } else { + MPI_Bcast(s2, 13, MPI_CHAR, n, MPI_COMM_WORLD); + if( strncmp(s1, s2, 13) ) { + fprintf(stderr, "Error in Bcast check: received \"%s\" instead of " + "\"hello world!\" from %d.\n", s2, n); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + /* Barrier */ + MPI_Barrier(MPI_COMM_WORLD); + /* Gather */ + memset(coll_buff, 0, size * 13 * sizeof(char)); + MPI_Gather(s1, 13, MPI_CHAR, coll_buff, 13, MPI_CHAR, n, MPI_COMM_WORLD); + if( n == world_rank ) { + for( i = 0; i < size; ++i ) { + if( strncmp(s1, &coll_buff[i * 13], 13) ) { + fprintf(stderr, "Error in Gather check: received \"%s\" instead of " + "\"hello world!\" from %d.\n", &coll_buff[i * 13], i); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + } + /* Reduce */ + MPI_Reduce(&world_rank, &sum_ranks, 1, MPI_INT, MPI_SUM, n, MPI_COMM_WORLD); + if( n == world_rank ) { + if( sum_ranks != ((size - 1) * size / 2) ) { + fprintf(stderr, "Error in Reduce check: sum_ranks=%d instead of %d.\n", + sum_ranks, (size - 1) * size / 2); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + } + free(coll_buff); + if( -1 == pvar_coll_check(session, size, world_rank) ) MPI_Abort(MPI_COMM_WORLD, -1); + + /* second phase: exchange size times data with everyone except self + in MPI_COMM_WORLD with Send/Recv */ + for( n = 0; n < size; ++n ) { + for( i = 0; i < size - 1; ++i ) { + to = (world_rank+1+i)%size; + from = (world_rank+size-1-i)%size; + if(world_rank < to){ + MPI_Send(s1, 13, MPI_CHAR, to, world_rank, MPI_COMM_WORLD); + MPI_Recv(s2, 13, MPI_CHAR, from, from, MPI_COMM_WORLD, &status); + } else { + MPI_Recv(s2, 13, MPI_CHAR, from, from, MPI_COMM_WORLD, &status); + MPI_Send(s1, 13, MPI_CHAR, to, world_rank, MPI_COMM_WORLD); + } + if( strncmp(s2, "hello world!", 13) ) { + fprintf(stderr, "Error in PML check: s2=\"%s\" instead of \"hello world!\".\n", + s2); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + } + if( -1 == pvar_pml_check(session, size, world_rank) ) MPI_Abort(MPI_COMM_WORLD, -1); + + /* third phase: exchange size times data with everyone, including self, in + MPI_COMM_WORLD with RMA opertations */ + char win_buff[20]; + MPI_Win win; + MPI_Win_create(win_buff, 20, sizeof(char), MPI_INFO_NULL, MPI_COMM_WORLD, &win); + for( n = 0; n < size; ++n ) { + for( i = 0; i < size; ++i ) { + MPI_Win_lock(MPI_LOCK_EXCLUSIVE, i, 0, win); + MPI_Put(s1, 13, MPI_CHAR, i, 0, 13, MPI_CHAR, win); + MPI_Win_unlock(i, win); + } + MPI_Win_lock(MPI_LOCK_EXCLUSIVE, world_rank, 0, win); + if( strncmp(win_buff, "hello world!", 13) ) { + fprintf(stderr, "Error in OSC check: win_buff=\"%s\" instead of \"hello world!\".\n", + win_buff); + MPI_Abort(MPI_COMM_WORLD, -1); + } + MPI_Win_unlock(world_rank, win); + for( i = 0; i < size; ++i ) { + MPI_Win_lock(MPI_LOCK_EXCLUSIVE, i, 0, win); + MPI_Get(s2, 13, MPI_CHAR, i, 0, 13, MPI_CHAR, win); + MPI_Win_unlock(i, win); + if( strncmp(s2, "hello world!", 13) ) { + fprintf(stderr, "Error in OSC check: s2=\"%s\" instead of \"hello world!\".\n", + s2); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } + } + MPI_Win_free(&win); + if( -1 == pvar_osc_check(session, size, world_rank) ) MPI_Abort(MPI_COMM_WORLD, -1); + + pvar_all_finalize(&session); + + MPI_Finalize(); + + return EXIT_SUCCESS; +} diff --git a/test/monitoring/example_reduce_count.c b/test/monitoring/example_reduce_count.c new file mode 100644 index 00000000000..d7811d2bf08 --- /dev/null +++ b/test/monitoring/example_reduce_count.c @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2017 Inria. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include +#include +#include + +static MPI_T_pvar_handle count_handle; +static const char count_pvar_name[] = "pml_monitoring_messages_count"; +static int count_pvar_idx; + +int main(int argc, char**argv) +{ + int rank, size, n, to, from, tagno, MPIT_result, provided, count; + MPI_T_pvar_session session; + MPI_Status status; + MPI_Request request; + size_t*counts; + + n = -1; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + to = (rank + 1) % size; + from = (rank + size - 1) % size; + tagno = 201; + + MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); + if (MPIT_result != MPI_SUCCESS) + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + + MPIT_result = MPI_T_pvar_get_index(count_pvar_name, MPI_T_PVAR_CLASS_SIZE, &count_pvar_idx); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_session_create(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot create a session for \"%s\" pvar\n", count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Allocating a new PVAR in a session will reset the counters */ + MPIT_result = MPI_T_pvar_handle_alloc(session, count_pvar_idx, + MPI_COMM_WORLD, &count_handle, &count); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + counts = (size_t*)malloc(count * sizeof(size_t)); + + MPIT_result = MPI_T_pvar_start(session, count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Token Ring communications */ + if (rank == 0) { + n = 25; + MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD,&request); + } + while (1) { + MPI_Irecv(&n, 1, MPI_INT, from, tagno, MPI_COMM_WORLD, &request); + MPI_Wait(&request, &status); + if (rank == 0) {n--;tagno++;} + MPI_Isend(&n, 1, MPI_INT, to, tagno, MPI_COMM_WORLD, &request); + if (rank != 0) {n--;tagno++;} + if (n<0){ + break; + } + } + + MPIT_result = MPI_T_pvar_read(session, count_handle, counts); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to read handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /*** REDUCE ***/ + MPI_Allreduce(MPI_IN_PLACE, counts, count, MPI_UNSIGNED_LONG, MPI_MAX, MPI_COMM_WORLD); + + if(0 == rank) { + for(n = 0; n < count; ++n) + printf("%zu%s", counts[n], n < count - 1 ? ", " : "\n"); + } + + free(counts); + + MPIT_result = MPI_T_pvar_stop(session, count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_handle_free(session, &count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_session_free(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot close a session for \"%s\" pvar\n", count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + (void)MPI_T_finalize(); + + MPI_Finalize(); + + return EXIT_SUCCESS; +} diff --git a/test/monitoring/monitoring_prof.c b/test/monitoring/monitoring_prof.c index 30c7824e848..3585c4927cf 100644 --- a/test/monitoring/monitoring_prof.c +++ b/test/monitoring/monitoring_prof.c @@ -1,10 +1,12 @@ /* - * Copyright (c) 2013-2016 The University of Tennessee and The University + * Copyright (c) 2013-2017 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * Copyright (c) 2013-2015 Bull SAS. All rights reserved. - * Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -19,6 +21,7 @@ Designed by: George Bosilca Emmanuel Jeannot Guillaume Papauré + Clément Foyer Contact the authors for questions. @@ -43,8 +46,6 @@ writing 4x4 matrix to monitoring_avg.mat #include #include #include -#include -#include static MPI_T_pvar_session session; static int comm_world_size; @@ -55,14 +56,24 @@ struct monitoring_result char * pvar_name; int pvar_idx; MPI_T_pvar_handle pvar_handle; - uint64_t * vector; + size_t * vector; }; typedef struct monitoring_result monitoring_result; -static monitoring_result counts; -static monitoring_result sizes; - -static int write_mat(char *, uint64_t *, unsigned int); +/* PML Sent */ +static monitoring_result pml_counts; +static monitoring_result pml_sizes; +/* OSC Sent */ +static monitoring_result osc_scounts; +static monitoring_result osc_ssizes; +/* OSC Recv */ +static monitoring_result osc_rcounts; +static monitoring_result osc_rsizes; +/* COLL Sent/Recv */ +static monitoring_result coll_counts; +static monitoring_result coll_sizes; + +static int write_mat(char *, size_t *, unsigned int); static void init_monitoring_result(const char *, monitoring_result *); static void start_monitoring_result(monitoring_result *); static void stop_monitoring_result(monitoring_result *); @@ -91,11 +102,23 @@ int MPI_Init(int* argc, char*** argv) PMPI_Abort(MPI_COMM_WORLD, MPIT_result); } - init_monitoring_result("pml_monitoring_messages_count", &counts); - init_monitoring_result("pml_monitoring_messages_size", &sizes); - - start_monitoring_result(&counts); - start_monitoring_result(&sizes); + init_monitoring_result("pml_monitoring_messages_count", &pml_counts); + init_monitoring_result("pml_monitoring_messages_size", &pml_sizes); + init_monitoring_result("osc_monitoring_messages_sent_count", &osc_scounts); + init_monitoring_result("osc_monitoring_messages_sent_size", &osc_ssizes); + init_monitoring_result("osc_monitoring_messages_recv_count", &osc_rcounts); + init_monitoring_result("osc_monitoring_messages_recv_size", &osc_rsizes); + init_monitoring_result("coll_monitoring_messages_count", &coll_counts); + init_monitoring_result("coll_monitoring_messages_size", &coll_sizes); + + start_monitoring_result(&pml_counts); + start_monitoring_result(&pml_sizes); + start_monitoring_result(&osc_scounts); + start_monitoring_result(&osc_ssizes); + start_monitoring_result(&osc_rcounts); + start_monitoring_result(&osc_rsizes); + start_monitoring_result(&coll_counts); + start_monitoring_result(&coll_sizes); return result; } @@ -103,48 +126,143 @@ int MPI_Init(int* argc, char*** argv) int MPI_Finalize(void) { int result, MPIT_result; - uint64_t * exchange_count_matrix = NULL; - uint64_t * exchange_size_matrix = NULL; - uint64_t * exchange_avg_size_matrix = NULL; + size_t * exchange_count_matrix_1 = NULL; + size_t * exchange_size_matrix_1 = NULL; + size_t * exchange_count_matrix_2 = NULL; + size_t * exchange_size_matrix_2 = NULL; + size_t * exchange_all_size_matrix = NULL; + size_t * exchange_all_count_matrix = NULL; + size_t * exchange_all_avg_matrix = NULL; + + stop_monitoring_result(&pml_counts); + stop_monitoring_result(&pml_sizes); + stop_monitoring_result(&osc_scounts); + stop_monitoring_result(&osc_ssizes); + stop_monitoring_result(&osc_rcounts); + stop_monitoring_result(&osc_rsizes); + stop_monitoring_result(&coll_counts); + stop_monitoring_result(&coll_sizes); + + get_monitoring_result(&pml_counts); + get_monitoring_result(&pml_sizes); + get_monitoring_result(&osc_scounts); + get_monitoring_result(&osc_ssizes); + get_monitoring_result(&osc_rcounts); + get_monitoring_result(&osc_rsizes); + get_monitoring_result(&coll_counts); + get_monitoring_result(&coll_sizes); if (0 == comm_world_rank) { - exchange_count_matrix = (uint64_t *) malloc(comm_world_size * comm_world_size * sizeof(uint64_t)); - exchange_size_matrix = (uint64_t *) malloc(comm_world_size * comm_world_size * sizeof(uint64_t)); - exchange_avg_size_matrix = (uint64_t *) malloc(comm_world_size * comm_world_size * sizeof(uint64_t)); + exchange_count_matrix_1 = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_size_matrix_1 = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_count_matrix_2 = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_size_matrix_2 = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_all_size_matrix = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_all_count_matrix = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); + exchange_all_avg_matrix = (size_t *) calloc(comm_world_size * comm_world_size, sizeof(size_t)); } - stop_monitoring_result(&counts); - stop_monitoring_result(&sizes); + /* Gather PML and COLL results */ + PMPI_Gather(pml_counts.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_count_matrix_1, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(pml_sizes.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_size_matrix_1, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(coll_counts.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_count_matrix_2, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(coll_sizes.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_size_matrix_2, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + + if (0 == comm_world_rank) { + int i, j; - get_monitoring_result(&counts); - get_monitoring_result(&sizes); + for (i = 0; i < comm_world_size; ++i) { + for (j = i + 1; j < comm_world_size; ++j) { + /* Reduce PML results */ + exchange_count_matrix_1[i * comm_world_size + j] = exchange_count_matrix_1[j * comm_world_size + i] = (exchange_count_matrix_1[i * comm_world_size + j] + exchange_count_matrix_1[j * comm_world_size + i]) / 2; + exchange_size_matrix_1[i * comm_world_size + j] = exchange_size_matrix_1[j * comm_world_size + i] = (exchange_size_matrix_1[i * comm_world_size + j] + exchange_size_matrix_1[j * comm_world_size + i]) / 2; + if (exchange_count_matrix_1[i * comm_world_size + j] != 0) + exchange_all_size_matrix[i * comm_world_size + j] = exchange_all_size_matrix[j * comm_world_size + i] = exchange_size_matrix_1[i * comm_world_size + j] / exchange_count_matrix_1[i * comm_world_size + j]; + + /* Reduce COLL results */ + exchange_count_matrix_2[i * comm_world_size + j] = exchange_count_matrix_2[j * comm_world_size + i] = (exchange_count_matrix_2[i * comm_world_size + j] + exchange_count_matrix_2[j * comm_world_size + i]) / 2; + exchange_size_matrix_2[i * comm_world_size + j] = exchange_size_matrix_2[j * comm_world_size + i] = (exchange_size_matrix_2[i * comm_world_size + j] + exchange_size_matrix_2[j * comm_world_size + i]) / 2; + if (exchange_count_matrix_2[i * comm_world_size + j] != 0) + exchange_all_count_matrix[i * comm_world_size + j] = exchange_all_count_matrix[j * comm_world_size + i] = exchange_size_matrix_2[i * comm_world_size + j] / exchange_count_matrix_2[i * comm_world_size + j]; + } + } + + /* Write PML matrices */ + write_mat("monitoring_pml_msg.mat", exchange_count_matrix_1, comm_world_size); + write_mat("monitoring_pml_size.mat", exchange_size_matrix_1, comm_world_size); + write_mat("monitoring_pml_avg.mat", exchange_all_size_matrix, comm_world_size); + + /* Write COLL matrices */ + write_mat("monitoring_coll_msg.mat", exchange_count_matrix_2, comm_world_size); + write_mat("monitoring_coll_size.mat", exchange_size_matrix_2, comm_world_size); + write_mat("monitoring_coll_avg.mat", exchange_all_count_matrix, comm_world_size); + + /* Aggregate PML and COLL in ALL matrices */ + for (i = 0; i < comm_world_size; ++i) { + for (j = i + 1; j < comm_world_size; ++j) { + exchange_all_size_matrix[i * comm_world_size + j] = exchange_all_size_matrix[j * comm_world_size + i] = exchange_size_matrix_1[i * comm_world_size + j] + exchange_size_matrix_2[i * comm_world_size + j]; + exchange_all_count_matrix[i * comm_world_size + j] = exchange_all_count_matrix[j * comm_world_size + i] = exchange_count_matrix_1[i * comm_world_size + j] + exchange_count_matrix_2[i * comm_world_size + j]; + } + } + } - PMPI_Gather(counts.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_count_matrix, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); - PMPI_Gather(sizes.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_size_matrix, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + /* Gather OSC results */ + PMPI_Gather(osc_scounts.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_count_matrix_1, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(osc_ssizes.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_size_matrix_1, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(osc_rcounts.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_count_matrix_2, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); + PMPI_Gather(osc_rsizes.vector, comm_world_size, MPI_UNSIGNED_LONG, exchange_size_matrix_2, comm_world_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); if (0 == comm_world_rank) { int i, j; - //Get the same matrix than profile2mat.pl for (i = 0; i < comm_world_size; ++i) { for (j = i + 1; j < comm_world_size; ++j) { - exchange_count_matrix[i * comm_world_size + j] = exchange_count_matrix[j * comm_world_size + i] = (exchange_count_matrix[i * comm_world_size + j] + exchange_count_matrix[j * comm_world_size + i]) / 2; - exchange_size_matrix[i * comm_world_size + j] = exchange_size_matrix[j * comm_world_size + i] = (exchange_size_matrix[i * comm_world_size + j] + exchange_size_matrix[j * comm_world_size + i]) / 2; - if (exchange_count_matrix[i * comm_world_size + j] != 0) - exchange_avg_size_matrix[i * comm_world_size + j] = exchange_avg_size_matrix[j * comm_world_size + i] = exchange_size_matrix[i * comm_world_size + j] / exchange_count_matrix[i * comm_world_size + j]; + /* Reduce OSC results */ + exchange_count_matrix_1[i * comm_world_size + j] = exchange_count_matrix_1[j * comm_world_size + i] = (exchange_count_matrix_1[i * comm_world_size + j] + exchange_count_matrix_1[j * comm_world_size + i] + exchange_count_matrix_2[i * comm_world_size + j] + exchange_count_matrix_2[j * comm_world_size + i]) / 2; + exchange_size_matrix_1[i * comm_world_size + j] = exchange_size_matrix_1[j * comm_world_size + i] = (exchange_size_matrix_1[i * comm_world_size + j] + exchange_size_matrix_1[j * comm_world_size + i] + exchange_size_matrix_2[i * comm_world_size + j] + exchange_size_matrix_2[j * comm_world_size + i]) / 2; + if (exchange_count_matrix_1[i * comm_world_size + j] != 0) + exchange_all_avg_matrix[i * comm_world_size + j] = exchange_all_avg_matrix[j * comm_world_size + i] = exchange_size_matrix_1[i * comm_world_size + j] / exchange_count_matrix_1[i * comm_world_size + j]; } } - write_mat("monitoring_msg.mat", exchange_count_matrix, comm_world_size); - write_mat("monitoring_size.mat", exchange_size_matrix, comm_world_size); - write_mat("monitoring_avg.mat", exchange_avg_size_matrix, comm_world_size); + /* Write OSC matrices */ + write_mat("monitoring_osc_msg.mat", exchange_count_matrix_1, comm_world_size); + write_mat("monitoring_osc_size.mat", exchange_size_matrix_1, comm_world_size); + write_mat("monitoring_osc_avg.mat", exchange_all_avg_matrix, comm_world_size); + + /* Aggregate OSC in ALL matrices and compute AVG */ + for (i = 0; i < comm_world_size; ++i) { + for (j = i + 1; j < comm_world_size; ++j) { + exchange_all_size_matrix[i * comm_world_size + j] = exchange_all_size_matrix[j * comm_world_size + i] += exchange_size_matrix_1[i * comm_world_size + j]; + exchange_all_count_matrix[i * comm_world_size + j] = exchange_all_count_matrix[j * comm_world_size + i] += exchange_count_matrix_1[i * comm_world_size + j]; + if (exchange_all_count_matrix[i * comm_world_size + j] != 0) + exchange_all_avg_matrix[i * comm_world_size + j] = exchange_all_avg_matrix[j * comm_world_size + i] = exchange_all_size_matrix[i * comm_world_size + j] / exchange_all_count_matrix[i * comm_world_size + j]; + } + } + + /* Write ALL matrices */ + write_mat("monitoring_all_msg.mat", exchange_all_count_matrix, comm_world_size); + write_mat("monitoring_all_size.mat", exchange_all_size_matrix, comm_world_size); + write_mat("monitoring_all_avg.mat", exchange_all_avg_matrix, comm_world_size); + + /* Free matrices */ + free(exchange_count_matrix_1); + free(exchange_size_matrix_1); + free(exchange_count_matrix_2); + free(exchange_size_matrix_2); + free(exchange_all_count_matrix); + free(exchange_all_size_matrix); + free(exchange_all_avg_matrix); } - free(exchange_count_matrix); - free(exchange_size_matrix); - free(exchange_avg_size_matrix); - destroy_monitoring_result(&counts); - destroy_monitoring_result(&sizes); + destroy_monitoring_result(&pml_counts); + destroy_monitoring_result(&pml_sizes); + destroy_monitoring_result(&osc_scounts); + destroy_monitoring_result(&osc_ssizes); + destroy_monitoring_result(&osc_rcounts); + destroy_monitoring_result(&osc_rsizes); + destroy_monitoring_result(&coll_counts); + destroy_monitoring_result(&coll_sizes); MPIT_result = MPI_T_pvar_session_free(&session); if (MPIT_result != MPI_SUCCESS) { @@ -186,7 +304,7 @@ void init_monitoring_result(const char * pvar_name, monitoring_result * res) PMPI_Abort(MPI_COMM_WORLD, count); } - res->vector = (uint64_t *) malloc(comm_world_size * sizeof(uint64_t)); + res->vector = (size_t *) malloc(comm_world_size * sizeof(size_t)); } void start_monitoring_result(monitoring_result * res) @@ -236,7 +354,7 @@ void destroy_monitoring_result(monitoring_result * res) free(res->vector); } -int write_mat(char * filename, uint64_t * mat, unsigned int dim) +int write_mat(char * filename, size_t * mat, unsigned int dim) { FILE *matrix_file; int i, j; @@ -251,7 +369,7 @@ int write_mat(char * filename, uint64_t * mat, unsigned int dim) for (i = 0; i < comm_world_size; ++i) { for (j = 0; j < comm_world_size; ++j) { - fprintf(matrix_file, "%" PRIu64 " ", mat[i * comm_world_size + j]); + fprintf(matrix_file, "%zu ", mat[i * comm_world_size + j]); } fprintf(matrix_file, "\n"); } @@ -260,3 +378,67 @@ int write_mat(char * filename, uint64_t * mat, unsigned int dim) return 0; } + +/** + * MPI binding for fortran + */ + +#include +#include "ompi_config.h" +#include "opal/threads/thread_usage.h" +#include "ompi/mpi/fortran/base/constants.h" +#include "ompi/mpi/fortran/base/fint_2_int.h" + +void monitoring_prof_mpi_init_f2c( MPI_Fint * ); +void monitoring_prof_mpi_finalize_f2c( MPI_Fint * ); + +void monitoring_prof_mpi_init_f2c( MPI_Fint *ierr ) { + int c_ierr; + int argc = 0; + char ** argv = NULL; + + c_ierr = MPI_Init(&argc, &argv); + if (NULL != ierr) *ierr = OMPI_INT_2_FINT(c_ierr); +} + +void monitoring_prof_mpi_finalize_f2c( MPI_Fint *ierr ) { + int c_ierr; + + c_ierr = MPI_Finalize(); + if (NULL != ierr) *ierr = OMPI_INT_2_FINT(c_ierr); +} + +#if OPAL_HAVE_WEAK_SYMBOLS +#pragma weak MPI_INIT = monitoring_prof_mpi_init_f2c +#pragma weak mpi_init = monitoring_prof_mpi_init_f2c +#pragma weak mpi_init_ = monitoring_prof_mpi_init_f2c +#pragma weak mpi_init__ = monitoring_prof_mpi_init_f2c +#pragma weak MPI_Init_f = monitoring_prof_mpi_init_f2c +#pragma weak MPI_Init_f08 = monitoring_prof_mpi_init_f2c + +#pragma weak MPI_FINALIZE = monitoring_prof_mpi_finalize_f2c +#pragma weak mpi_finalize = monitoring_prof_mpi_finalize_f2c +#pragma weak mpi_finalize_ = monitoring_prof_mpi_finalize_f2c +#pragma weak mpi_finalize__ = monitoring_prof_mpi_finalize_f2c +#pragma weak MPI_Finalize_f = monitoring_prof_mpi_finalize_f2c +#pragma weak MPI_Finalize_f08 = monitoring_prof_mpi_finalize_f2c +#elif OMPI_BUILD_FORTRAN_BINDINGS +#define OMPI_F77_PROTOTYPES_MPI_H +#include "ompi/mpi/fortran/mpif-h/bindings.h" + +OMPI_GENERATE_F77_BINDINGS (MPI_INIT, + mpi_init, + mpi_init_, + mpi_init__, + monitoring_prof_mpi_init_f2c, + (MPI_Fint *ierr), + (ierr) ) + +OMPI_GENERATE_F77_BINDINGS (MPI_FINALIZE, + mpi_finalize, + mpi_finalize_, + mpi_finalize__, + monitoring_prof_mpi_finalize_f2c, + (MPI_Fint *ierr), + (ierr) ) +#endif diff --git a/test/monitoring/monitoring_test.c b/test/monitoring/monitoring_test.c index 70d51d17c29..f3616ab7908 100644 --- a/test/monitoring/monitoring_test.c +++ b/test/monitoring/monitoring_test.c @@ -2,7 +2,7 @@ * Copyright (c) 2013-2015 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2013-2015 Inria. All rights reserved. + * Copyright (c) 2013-2017 Inria. All rights reserved. * Copyright (c) 2015 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2016 Intel, Inc. All rights reserved. * $COPYRIGHT$ @@ -15,243 +15,362 @@ /* pml monitoring tester. -Designed by George Bosilca and Emmanuel Jeannot +Designed by George Bosilca Emmanuel Jeannot and Clément Foyer Contact the authors for questions. -To be run as: - -mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test -pm -Then, the output should be: - -flushing to ./prof/phase_1_2.prof -flushing to ./prof/phase_1_0.prof -flushing to ./prof/phase_1_3.prof -flushing to ./prof/phase_2_1.prof -flushing to ./prof/phase_2_3.prof -flushing to ./prof/phase_2_0.prof -flushing to ./prof/phase_2_2.prof -I 0 1 108 bytes 27 msgs sent -E 0 1 1012 bytes 30 msgs sent -E 0 2 23052 bytes 61 msgs sent -I 1 2 104 bytes 26 msgs sent -I 1 3 208 bytes 52 msgs sent -E 1 0 860 bytes 24 msgs sent -E 1 3 2552 bytes 56 msgs sent -I 2 3 104 bytes 26 msgs sent -E 2 0 22804 bytes 49 msgs sent -E 2 3 860 bytes 24 msgs sent -I 3 0 104 bytes 26 msgs sent -I 3 1 204 bytes 51 msgs sent -E 3 1 2304 bytes 44 msgs sent -E 3 2 860 bytes 24 msgs sent - -or as - -mpirun -np 4 --mca pml_monitoring_enable 1 ./monitoring_test - -for an output as: - -flushing to ./prof/phase_1_1.prof -flushing to ./prof/phase_1_0.prof -flushing to ./prof/phase_1_2.prof -flushing to ./prof/phase_1_3.prof -flushing to ./prof/phase_2_1.prof -flushing to ./prof/phase_2_3.prof -flushing to ./prof/phase_2_2.prof -flushing to ./prof/phase_2_0.prof -I 0 1 1120 bytes 57 msgs sent -I 0 2 23052 bytes 61 msgs sent -I 1 0 860 bytes 24 msgs sent -I 1 2 104 bytes 26 msgs sent -I 1 3 2760 bytes 108 msgs sent -I 2 0 22804 bytes 49 msgs sent -I 2 3 964 bytes 50 msgs sent -I 3 0 104 bytes 26 msgs sent -I 3 1 2508 bytes 95 msgs sent -I 3 2 860 bytes 24 msgs sent -*/ +To options are available for this test, with/without MPI_Tools, and with/without RMA operations. The default mode is without MPI_Tools, and with RMA operations. +To enable the MPI_Tools use, add "--with-mpit" as an application parameter. +To disable the RMA operations testing, add "--without-rma" as an application parameter. + +To be run as (without using MPI_Tool): + +mpirun -np 4 --mca pml_monitoring_enable 2 --mca pml_monitoring_enable_output 3 --mca pml_monitoring_filename prof/output ./monitoring_test + +with the results being, as an example: +output.1.prof +# POINT TO POINT +E 1 2 104 bytes 26 msgs sent 0,0,0,26,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +E 1 3 208 bytes 52 msgs sent 8,0,0,65,1,5,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +I 1 0 140 bytes 27 msgs sent +I 1 2 2068 bytes 1 msgs sent +I 1 3 2256 bytes 31 msgs sent +# OSC +S 1 0 0 bytes 1 msgs sent +R 1 0 40960 bytes 1 msgs sent +S 1 2 40960 bytes 1 msgs sent +# COLLECTIVES +C 1 0 140 bytes 27 msgs sent +C 1 2 140 bytes 27 msgs sent +C 1 3 140 bytes 27 msgs sent +D MPI COMMUNICATOR 4 DUP FROM 0 procs: 0,1,2,3 +O2A 1 0 bytes 0 msgs sent +A2O 1 0 bytes 0 msgs sent +A2A 1 276 bytes 15 msgs sent +D MPI_COMM_WORLD procs: 0,1,2,3 +O2A 1 0 bytes 0 msgs sent +A2O 1 0 bytes 0 msgs sent +A2A 1 96 bytes 9 msgs sent +D MPI COMMUNICATOR 5 SPLIT_TYPE FROM 4 procs: 0,1,2,3 +O2A 1 0 bytes 0 msgs sent +A2O 1 0 bytes 0 msgs sent +A2A 1 48 bytes 3 msgs sent +D MPI COMMUNICATOR 3 SPLIT FROM 0 procs: 1,3 +O2A 1 0 bytes 0 msgs sent +A2O 1 0 bytes 0 msgs sent +A2A 1 0 bytes 0 msgs sent +*/ -#include #include "mpi.h" +#include +#include static MPI_T_pvar_handle flush_handle; static const char flush_pvar_name[] = "pml_monitoring_flush"; +static const void*nullbuf = NULL; static int flush_pvar_idx; +static int with_mpit = 0; +static int with_rma = 1; int main(int argc, char* argv[]) { - int rank, size, n, to, from, tagno, MPIT_result, provided, count; + int rank, size, n, to, from, tagno, MPIT_result, provided, count, world_rank; MPI_T_pvar_session session; - MPI_Status status; MPI_Comm newcomm; - MPI_Request request; char filename[1024]; - + + for ( int arg_it = 1; argc > 1 && arg_it < argc; ++arg_it ) { + if( 0 == strcmp(argv[arg_it], "--with-mpit") ) { + with_mpit = 1; + printf("enable MPIT support\n"); + } else if( 0 == strcmp(argv[arg_it], "--without-rma") ) { + with_rma = 0; + printf("disable RMA testing\n"); + } + } /* first phase : make a token circulated in MPI_COMM_WORLD */ n = -1; - MPI_Init(&argc, &argv); - MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Init(NULL, NULL); + MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &size); + rank = world_rank; to = (rank + 1) % size; from = (rank - 1) % size; tagno = 201; - MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); - if (MPIT_result != MPI_SUCCESS) - MPI_Abort(MPI_COMM_WORLD, MPIT_result); + if( with_mpit ) { + MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); + if (MPIT_result != MPI_SUCCESS) + MPI_Abort(MPI_COMM_WORLD, MPIT_result); - MPIT_result = MPI_T_pvar_get_index(flush_pvar_name, MPI_T_PVAR_CLASS_GENERIC, &flush_pvar_idx); - if (MPIT_result != MPI_SUCCESS) { - printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } + MPIT_result = MPI_T_pvar_get_index(flush_pvar_name, MPI_T_PVAR_CLASS_GENERIC, &flush_pvar_idx); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } - MPIT_result = MPI_T_pvar_session_create(&session); - if (MPIT_result != MPI_SUCCESS) { - printf("cannot create a session for \"%s\" pvar\n", flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } + MPIT_result = MPI_T_pvar_session_create(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot create a session for \"%s\" pvar\n", flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } - /* Allocating a new PVAR in a session will reset the counters */ - MPIT_result = MPI_T_pvar_handle_alloc(session, flush_pvar_idx, - MPI_COMM_WORLD, &flush_handle, &count); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } + /* Allocating a new PVAR in a session will reset the counters */ + MPIT_result = MPI_T_pvar_handle_alloc(session, flush_pvar_idx, + MPI_COMM_WORLD, &flush_handle, &count); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } - MPIT_result = MPI_T_pvar_start(session, flush_handle); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); + MPIT_result = MPI_T_pvar_start(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } } if (rank == 0) { n = 25; - MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD,&request); + MPI_Send(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD); } while (1) { - MPI_Irecv(&n,1,MPI_INT,from,tagno,MPI_COMM_WORLD, &request); - MPI_Wait(&request,&status); + MPI_Recv(&n, 1, MPI_INT, from, tagno, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (rank == 0) {n--;tagno++;} - MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD, &request); + MPI_Send(&n, 1, MPI_INT, to, tagno, MPI_COMM_WORLD); if (rank != 0) {n--;tagno++;} if (n<0){ break; } } - /* Build one file per processes - Every thing that has been monitored by each - process since the last flush will be output in filename */ + if( with_mpit ) { + /* Build one file per processes + Every thing that has been monitored by each + process since the last flush will be output in filename */ + /* + Requires directory prof to be created. + Filename format should display the phase number + and the process rank for ease of parsing with + aggregate_profile.pl script + */ + sprintf(filename, "prof/phase_1"); - /* - Requires directory prof to be created. - Filename format should display the phase number - and the process rank for ease of parsing with - aggregate_profile.pl script - */ - sprintf(filename,"prof/phase_1_%d.prof",rank); - if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, filename) ) { - fprintf(stderr, "Process %d cannot save monitoring in %s\n", rank, filename); - } - /* Force the writing of the monitoring data */ - MPIT_result = MPI_T_pvar_stop(session, flush_handle); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, filename) ) { + fprintf(stderr, "Process %d cannot save monitoring in %s.%d.prof\n", + world_rank, filename, world_rank); + } + /* Force the writing of the monitoring data */ + MPIT_result = MPI_T_pvar_stop(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } - MPIT_result = MPI_T_pvar_start(session, flush_handle); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } - /* Don't set a filename. If we stop the session before setting it, then no output ile - * will be generated. - */ - if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, NULL) ) { - fprintf(stderr, "Process %d cannot save monitoring in %s\n", rank, filename); + MPIT_result = MPI_T_pvar_start(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + /* Don't set a filename. If we stop the session before setting it, then no output file + * will be generated. + */ + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, (void*)&nullbuf) ) { + fprintf(stderr, "Process %d cannot save monitoring in %s\n", world_rank, filename); + } } /* Second phase. Work with different communicators. - even ranls will circulate a token - while odd ranks wil perform a all_to_all + even ranks will circulate a token + while odd ranks will perform a all_to_all */ MPI_Comm_split(MPI_COMM_WORLD, rank%2, rank, &newcomm); - /* the filename for flushing monitoring now uses 2 as phase number! */ - sprintf(filename, "prof/phase_2_%d.prof", rank); - - if(rank%2){ /*even ranks (in COMM_WORD) circulate a token*/ + if(rank%2){ /*odd ranks (in COMM_WORD) circulate a token*/ MPI_Comm_rank(newcomm, &rank); MPI_Comm_size(newcomm, &size); if( size > 1 ) { - to = (rank + 1) % size;; - from = (rank - 1) % size ; + to = (rank + 1) % size; + from = (rank - 1) % size; tagno = 201; if (rank == 0){ n = 50; MPI_Send(&n, 1, MPI_INT, to, tagno, newcomm); } while (1){ - MPI_Recv(&n, 1, MPI_INT, from, tagno, newcomm, &status); + MPI_Recv(&n, 1, MPI_INT, from, tagno, newcomm, MPI_STATUS_IGNORE); if (rank == 0) {n--; tagno++;} MPI_Send(&n, 1, MPI_INT, to, tagno, newcomm); if (rank != 0) {n--; tagno++;} if (n<0){ - if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, filename) ) { - fprintf(stderr, "Process %d cannot save monitoring in %s\n", rank, filename); - } break; } } } - } else { /*odd ranks (in COMM_WORD) will perform a all_to_all and a barrier*/ + } else { /*even ranks (in COMM_WORD) will perform a all_to_all and a barrier*/ int send_buff[10240]; int recv_buff[10240]; + MPI_Comm newcomm2; MPI_Comm_rank(newcomm, &rank); MPI_Comm_size(newcomm, &size); MPI_Alltoall(send_buff, 10240/size, MPI_INT, recv_buff, 10240/size, MPI_INT, newcomm); - MPI_Comm_split(newcomm, rank%2, rank, &newcomm); - MPI_Barrier(newcomm); + MPI_Comm_split(newcomm, rank%2, rank, &newcomm2); + MPI_Barrier(newcomm2); + MPI_Comm_free(&newcomm2); + } + + if( with_mpit ) { + /* Build one file per processes + Every thing that has been monitored by each + process since the last flush will be output in filename */ + /* + Requires directory prof to be created. + Filename format should display the phase number + and the process rank for ease of parsing with + aggregate_profile.pl script + */ + sprintf(filename, "prof/phase_2"); + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, filename) ) { - fprintf(stderr, "Process %d cannot save monitoring in %s\n", rank, filename); + fprintf(stderr, "Process %d cannot save monitoring in %s.%d.prof\n", + world_rank, filename, world_rank); } - } - MPIT_result = MPI_T_pvar_stop(session, flush_handle); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); - } + /* Force the writing of the monitoring data */ + MPIT_result = MPI_T_pvar_stop(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } - MPIT_result = MPI_T_pvar_handle_free(session, &flush_handle); - if (MPIT_result != MPI_SUCCESS) { - printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", - flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); + MPIT_result = MPI_T_pvar_start(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + /* Don't set a filename. If we stop the session before setting it, then no output + * will be generated. + */ + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, (void*)&nullbuf ) ) { + fprintf(stderr, "Process %d cannot save monitoring in %s\n", world_rank, filename); + } } - MPIT_result = MPI_T_pvar_session_free(&session); - if (MPIT_result != MPI_SUCCESS) { - printf("cannot close a session for \"%s\" pvar\n", flush_pvar_name); - MPI_Abort(MPI_COMM_WORLD, MPIT_result); + if( with_rma ) { + MPI_Win win; + int rs_buff[10240]; + int win_buff[10240]; + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + to = (rank + 1) % size; + from = (rank + size - 1) % size; + for( int v = 0; v < 10240; ++v ) + rs_buff[v] = win_buff[v] = rank; + + MPI_Win_create(win_buff, 10240 * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); + MPI_Win_fence(MPI_MODE_NOPRECEDE, win); + if( rank%2 ) { + MPI_Win_fence(MPI_MODE_NOSTORE | MPI_MODE_NOPUT, win); + MPI_Get(rs_buff, 10240, MPI_INT, from, 0, 10240, MPI_INT, win); + } else { + MPI_Put(rs_buff, 10240, MPI_INT, to, 0, 10240, MPI_INT, win); + MPI_Win_fence(MPI_MODE_NOSTORE | MPI_MODE_NOPUT, win); + } + MPI_Win_fence(MPI_MODE_NOSUCCEED, win); + + for( int v = 0; v < 10240; ++v ) + if( rs_buff[v] != win_buff[v] && ((rank%2 && rs_buff[v] != from) || (!(rank%2) && rs_buff[v] != rank)) ) { + printf("Error on checking exchanged values: %s_buff[%d] == %d instead of %d\n", + rank%2 ? "rs" : "win", v, rs_buff[v], rank%2 ? from : rank); + MPI_Abort(MPI_COMM_WORLD, -1); + } + + MPI_Group world_group, newcomm_group, distant_group; + MPI_Comm_group(MPI_COMM_WORLD, &world_group); + MPI_Comm_group(newcomm, &newcomm_group); + MPI_Group_difference(world_group, newcomm_group, &distant_group); + if( rank%2 ) { + MPI_Win_post(distant_group, 0, win); + MPI_Win_wait(win); + /* Check recieved values */ + for( int v = 0; v < 10240; ++v ) + if( from != win_buff[v] ) { + printf("Error on checking exchanged values: win_buff[%d] == %d instead of %d\n", + v, win_buff[v], from); + MPI_Abort(MPI_COMM_WORLD, -1); + } + } else { + MPI_Win_start(distant_group, 0, win); + MPI_Put(rs_buff, 10240, MPI_INT, to, 0, 10240, MPI_INT, win); + MPI_Win_complete(win); + } + MPI_Group_free(&world_group); + MPI_Group_free(&newcomm_group); + MPI_Group_free(&distant_group); + MPI_Barrier(MPI_COMM_WORLD); + + for( int v = 0; v < 10240; ++v ) rs_buff[v] = rank; + + MPI_Win_lock(MPI_LOCK_EXCLUSIVE, to, 0, win); + MPI_Put(rs_buff, 10240, MPI_INT, to, 0, 10240, MPI_INT, win); + MPI_Win_unlock(to, win); + + MPI_Barrier(MPI_COMM_WORLD); + + /* Check recieved values */ + for( int v = 0; v < 10240; ++v ) + if( from != win_buff[v] ) { + printf("Error on checking exchanged values: win_buff[%d] == %d instead of %d\n", + v, win_buff[v], from); + MPI_Abort(MPI_COMM_WORLD, -1); + } + + MPI_Win_free(&win); } - (void)PMPI_T_finalize(); + if( with_mpit ) { + /* the filename for flushing monitoring now uses 3 as phase number! */ + sprintf(filename, "prof/phase_3"); + + if( MPI_SUCCESS != MPI_T_pvar_write(session, flush_handle, filename) ) { + fprintf(stderr, "Process %d cannot save monitoring in %s.%d.prof\n", + world_rank, filename, world_rank); + } + + MPIT_result = MPI_T_pvar_stop(session, flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_handle_free(session, &flush_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", + flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_session_free(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot close a session for \"%s\" pvar\n", flush_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + (void)MPI_T_finalize(); + } + MPI_Comm_free(&newcomm); /* Now, in MPI_Finalize(), the pml_monitoring library outputs, in STDERR, the aggregated recorded monitoring of all the phases*/ MPI_Finalize(); diff --git a/test/monitoring/profile2mat.pl b/test/monitoring/profile2mat.pl index a6ea6a52bb4..69275a24ff5 100644 --- a/test/monitoring/profile2mat.pl +++ b/test/monitoring/profile2mat.pl @@ -4,7 +4,7 @@ # Copyright (c) 2013-2015 The University of Tennessee and The University # of Tennessee Research Foundation. All rights # reserved. -# Copyright (c) 2013-2015 Inria. All rights reserved. +# Copyright (c) 2013-2016 Inria. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -35,9 +35,11 @@ $filename=$ARGV[0]; } -profile($filename,"I|E","all"); +profile($filename,"I|E|S|R|C","all"); if ( profile($filename,"E","external") ){ - profile($filename,"I","internal"); + profile($filename,"I","internal"); + profile($filename,"S|R","osc"); + profile($filename,"C","coll"); } sub profile{ diff --git a/test/monitoring/test_overhead.c b/test/monitoring/test_overhead.c new file mode 100644 index 00000000000..43717294bf9 --- /dev/null +++ b/test/monitoring/test_overhead.c @@ -0,0 +1,294 @@ +/* + * Copyright (c) 2016-2017 Inria. All rights reserved. + * Copyright (c) 2017 Research Organization for Information Science + * and Technology (RIST). All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + Measurement for thze pml_monitoring component overhead + + Designed by Clement Foyer + Contact the authors for questions. + + To be run as: + + +*/ + +#include +#include +#include +#include +#include +#include "mpi.h" + +#define NB_ITER 1000 +#define FULL_NB_ITER (size_world * NB_ITER) +#define MAX_SIZE (1024 * 1024 * 1.4) +#define NB_OPS 6 + +static int rank_world = -1; +static int size_world = 0; +static int to = -1; +static int from = -1; +static MPI_Win win = MPI_WIN_NULL; + +/* Sorting results */ +static int comp_double(const void*_a, const void*_b) +{ + const double*a = _a; + const double*b = _b; + if(*a < *b) + return -1; + else if(*a > *b) + return 1; + else + return 0; +} + +/* Timing */ +static inline void get_tick(struct timespec*t) +{ +#if defined(__bg__) +# define CLOCK_TYPE CLOCK_REALTIME +#elif defined(CLOCK_MONOTONIC_RAW) +# define CLOCK_TYPE CLOCK_MONOTONIC_RAW +#elif defined(CLOCK_MONOTONIC) +# define CLOCK_TYPE CLOCK_MONOTONIC +#endif +#if defined(CLOCK_TYPE) + clock_gettime(CLOCK_TYPE, t); +#else + struct timeval tv; + gettimeofday(&tv, NULL); + t->tv_sec = tv.tv_sec; + t->tv_nsec = tv.tv_usec * 1000; +#endif +} +static inline double timing_delay(const struct timespec*const t1, const struct timespec*const t2) +{ + const double delay = 1000000.0 * (t2->tv_sec - t1->tv_sec) + (t2->tv_nsec - t1->tv_nsec) / 1000.0; + return delay; +} + +/* Operations */ +static inline void op_send(double*res, void*sbuf, int size, int tagno, void*rbuf) { + MPI_Request request; + struct timespec start, end; + + /* Post to be sure no unexpected message will be generated */ + MPI_Irecv(rbuf, size, MPI_BYTE, from, tagno, MPI_COMM_WORLD, &request); + + /* Token ring to synchronize */ + /* We send to the sender to make him know we are ready to + receive (even for non-eager mode sending) */ + if( 0 == rank_world ) { + MPI_Send(NULL, 0, MPI_BYTE, from, 100, MPI_COMM_WORLD); + MPI_Recv(NULL, 0, MPI_BYTE, to, 100, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + } else { + MPI_Recv(NULL, 0, MPI_BYTE, to, 100, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + MPI_Send(NULL, 0, MPI_BYTE, from, 100, MPI_COMM_WORLD); + } + + /* do monitored operation */ + get_tick(&start); + MPI_Send(sbuf, size, MPI_BYTE, to, tagno, MPI_COMM_WORLD); + get_tick(&end); + + MPI_Wait(&request, MPI_STATUS_IGNORE); + *res = timing_delay(&start, &end); +} + +static inline void op_send_pingpong(double*res, void*sbuf, int size, int tagno, void*rbuf) { + struct timespec start, end; + + MPI_Barrier(MPI_COMM_WORLD); + + /* do monitored operation */ + if(rank_world % 2) { /* Odd ranks : Recv - Send */ + MPI_Recv(rbuf, size, MPI_BYTE, from, tagno, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + MPI_Send(sbuf, size, MPI_BYTE, from, tagno, MPI_COMM_WORLD); + MPI_Barrier(MPI_COMM_WORLD); + get_tick(&start); + MPI_Send(sbuf, size, MPI_BYTE, from, tagno, MPI_COMM_WORLD); + MPI_Recv(rbuf, size, MPI_BYTE, from, tagno, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + get_tick(&end); + } else { /* Even ranks : Send - Recv */ + get_tick(&start); + MPI_Send(sbuf, size, MPI_BYTE, to, tagno, MPI_COMM_WORLD); + MPI_Recv(rbuf, size, MPI_BYTE, to, tagno, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + get_tick(&end); + MPI_Barrier(MPI_COMM_WORLD); + MPI_Recv(rbuf, size, MPI_BYTE, to, tagno, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + MPI_Send(sbuf, size, MPI_BYTE, to, tagno, MPI_COMM_WORLD); + } + + *res = timing_delay(&start, &end) / 2; +} + +static inline void op_coll(double*res, void*buff, int size, int tagno, void*rbuf) { + struct timespec start, end; + MPI_Barrier(MPI_COMM_WORLD); + + /* do monitored operation */ + get_tick(&start); + MPI_Bcast(buff, size, MPI_BYTE, 0, MPI_COMM_WORLD); + get_tick(&end); + + *res = timing_delay(&start, &end); +} + +static inline void op_a2a(double*res, void*sbuf, int size, int tagno, void*rbuf) { + struct timespec start, end; + MPI_Barrier(MPI_COMM_WORLD); + + /* do monitored operation */ + get_tick(&start); + MPI_Alltoall(sbuf, size, MPI_BYTE, rbuf, size, MPI_BYTE, MPI_COMM_WORLD); + get_tick(&end); + + *res = timing_delay(&start, &end); +} + +static inline void op_put(double*res, void*sbuf, int size, int tagno, void*rbuf) { + struct timespec start, end; + + MPI_Win_lock(MPI_LOCK_EXCLUSIVE, to, 0, win); + + /* do monitored operation */ + get_tick(&start); + MPI_Put(sbuf, size, MPI_BYTE, to, 0, size, MPI_BYTE, win); + MPI_Win_unlock(to, win); + get_tick(&end); + + *res = timing_delay(&start, &end); +} + +static inline void op_get(double*res, void*rbuf, int size, int tagno, void*sbuf) { + struct timespec start, end; + + MPI_Win_lock(MPI_LOCK_SHARED, to, 0, win); + + /* do monitored operation */ + get_tick(&start); + MPI_Get(rbuf, size, MPI_BYTE, to, 0, size, MPI_BYTE, win); + MPI_Win_unlock(to, win); + get_tick(&end); + + *res = timing_delay(&start, &end); +} + +static inline void do_bench(int size, char*sbuf, double*results, + void(*op)(double*, void*, int, int, void*)) { + int iter; + int tagno = 201; + char*rbuf = sbuf ? sbuf + size : NULL; + + if(op == op_put || op == op_get){ + win = MPI_WIN_NULL; + MPI_Win_create(rbuf, size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); + } + + for( iter = 0; iter < NB_ITER; ++iter ) { + op(&results[iter], sbuf, size, tagno, rbuf); + MPI_Barrier(MPI_COMM_WORLD); + } + + if(op == op_put || op == op_get){ + MPI_Win_free(&win); + win = MPI_WIN_NULL; + } +} + +int main(int argc, char* argv[]) +{ + int size, iter, nop; + char*sbuf = NULL; + double results[NB_ITER]; + void(*op)(double*, void*, int, int, void*); + char name[255]; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &rank_world); + MPI_Comm_size(MPI_COMM_WORLD, &size_world); + to = (rank_world + 1) % size_world; + from = (rank_world + size_world - 1) % size_world; + + double full_res[FULL_NB_ITER]; + + for( nop = 0; nop < NB_OPS; ++nop ) { + switch(nop) { + case 0: + op = op_send; + sprintf(name, "MPI_Send"); + break; + case 1: + op = op_coll; + sprintf(name, "MPI_Bcast"); + break; + case 2: + op = op_a2a; + sprintf(name, "MPI_Alltoall"); + break; + case 3: + op = op_put; + sprintf(name, "MPI_Put"); + break; + case 4: + op = op_get; + sprintf(name, "MPI_Get"); + break; + case 5: + op = op_send_pingpong; + sprintf(name, "MPI_Send_pp"); + break; + } + + if( 0 == rank_world ) + printf("# %s%%%d\n# size \t| latency \t| 10^6 B/s \t| MB/s \t| median \t| q1 \t| q3 \t| d1 \t| d9 \t| avg \t| max\n", name, size_world); + + for(size = 0; size < MAX_SIZE; size = ((int)(size * 1.4) > size) ? (size * 1.4) : (size + 1)) { + /* Init buffers */ + if( 0 != size ) { + sbuf = (char *)realloc(sbuf, (size_world + 1) * size); /* sbuf + alltoall recv buf */ + } + + do_bench(size, sbuf, results, op); + + MPI_Gather(results, NB_ITER, MPI_DOUBLE, full_res, NB_ITER, MPI_DOUBLE, 0, MPI_COMM_WORLD); + + if( 0 == rank_world ) { + qsort(full_res, FULL_NB_ITER, sizeof(double), &comp_double); + const double min_lat = full_res[0]; + const double max_lat = full_res[FULL_NB_ITER - 1]; + const double med_lat = full_res[(FULL_NB_ITER - 1) / 2]; + const double q1_lat = full_res[(FULL_NB_ITER - 1) / 4]; + const double q3_lat = full_res[ 3 * (FULL_NB_ITER - 1) / 4]; + const double d1_lat = full_res[(FULL_NB_ITER - 1) / 10]; + const double d9_lat = full_res[ 9 * (FULL_NB_ITER - 1) / 10]; + double avg_lat = 0.0; + for( iter = 0; iter < FULL_NB_ITER; iter++ ){ + avg_lat += full_res[iter]; + } + avg_lat /= FULL_NB_ITER; + const double bw_million_byte = size / min_lat; + const double bw_mbyte = bw_million_byte / 1.048576; + + printf("%9lld\t%9.3lf\t%9.3f\t%9.3f\t%9.3lf\t%9.3lf\t%9.3lf\t%9.3lf\t%9.3lf\t%9.3lf\t%9.3lf", + (long long)size, min_lat, bw_million_byte, bw_mbyte, + med_lat, q1_lat, q3_lat, d1_lat, d9_lat, + avg_lat, max_lat); + printf("\n"); + } + } + free(sbuf); + sbuf = NULL; + } + + MPI_Finalize(); + return EXIT_SUCCESS; +} diff --git a/test/monitoring/test_overhead.sh b/test/monitoring/test_overhead.sh new file mode 100755 index 00000000000..3f263f1d6f8 --- /dev/null +++ b/test/monitoring/test_overhead.sh @@ -0,0 +1,216 @@ +#!/bin/bash + +# +# Copyright (c) 2016-2017 Inria. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# +# Author Clément Foyer +# +# This script launches the test_overhead test case for 2, 4, 8, 12, +# 16, 20 and 24 processes, once with the monitoring component enabled, +# and once without any monitoring. It then parses and aggregates the +# results in order to create heatmaps. To work properly, this scripts +# needs sqlite3, sed, awk and gnuplot. It also needs the rights to +# write/create directories in the working path. Temporary files can be +# found in $resdir/.tmp. They are cleaned between two executions fo +# this script. +# +# This file create as an output one heatmap per operation +# tested. Currently, tested operations are : +# - MPI_Send (software overhead) +# - MPI_Send (ping-pong, to measure theoverhead with the communciation time) +# - MPI_Bcast +# - MPI_Alltoall +# - MPI_Put +# - MPI_Get +# + +exe=test_overhead + +# add common options +if [ $# -ge 1 ] +then + mfile="-machinefile $1" +fi +common_opt="$mfile --bind-to core" + +# dir +resdir=res +tmpdir=$resdir/.tmp +# files +base_nomon=$resdir/unmonitored +base_mon=$resdir/monitored +dbfile=$tmpdir/base.db +dbscript=$tmpdir/overhead.sql +plotfile=$tmpdir/plot.gp +# operations +ops=(send a2a bcast put get sendpp) + +# no_monitoring(nb_nodes, exe_name, output_filename, error_filename) +function no_monitoring() { + mpiexec -n $1 $common_opt --mca pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring $2 2> $4 > $3 +} + +# monitoring(nb_nodes, exe_name, output_filename, error_filename) +function monitoring() { + mpiexec -n $1 $common_opt --mca pml_monitoring_enable 1 --mca pml_monitoring_enable_output 3 --mca pml_monitoring_filename "prof/toto" $2 2> $4 > $3 +} + +# filter_output(filenames_list) +function filter_output() { + for filename in "$@" + do + # remove extra texts from the output + sed -i '/--------------------------------------------------------------------------/,/--------------------------------------------------------------------------/d' $filename + # create all sub files as $tmpdir/$filename + file=$(sed -e "s|$resdir/|$tmpdir/|" -e "s/\.dat/.csv/" <<< $filename) + # split in file, one per kind of operation monitored + awk "/^# MPI_Send/ {out=\"$(sed "s/\.$nbprocs/.send&/" <<< $file)\"}; \ + /^# MPI_Bcast/ {out=\"$(sed "s/\.$nbprocs/.bcast&/" <<< $file)\"}; \ + /^# MPI_Alltoall/ {out=\"$(sed "s/\.$nbprocs/.a2a&/" <<< $file)\"}; \ + /^# MPI_Put/ {out=\"$(sed "s/\.$nbprocs/.put&/" <<< $file)\"}; \ + /^# MPI_Get/ {out=\"$(sed "s/\.$nbprocs/.get&/" <<< $file)\"}; \ + /^# MPI_Send_pp/ {out=\"$(sed "s/\.$nbprocs/.sendpp&/" <<< $file)\"}; \ + /^#/ { } ; !/^#/ {\$0=\"$nbprocs \"\$0; print > out};" \ + out=$tmpdir/tmp $filename + done + # trim spaces and replace them with comma in each file generated with awk + for file in `ls $tmpdir/*.*.$nbprocs.csv` + do + sed -i 's/[[:space:]]\{1,\}/,/g' $file + done +} + +# clean previous execution if any +if [ -d $tmpdir ] +then + rm -fr $tmpdir +fi +mkdir -p $tmpdir + +# start creating the sql file for data post-processing +cat > $dbscript <> $dbscript + echo -e "create table if not exists ${op}_mon (nbprocs integer, datasize integer, lat float, speed float, MBspeed float, media float, q1 float, q3 float, d1 float, d9 float, average float, maximum float, primary key (nbprocs, datasize) on conflict abort);\ncreate table if not exists ${op}_nomon (nbprocs integer, datasize integer, lat float, speed float, MBspeed float, media float, q1 float, q3 float, d1 float, d9 float, average float, maximum float, primary key (nbprocs, datasize) on conflict abort);" >> $dbscript +done + +# main loop to launch benchmarks +for nbprocs in 2 4 8 12 16 20 24 +do + echo "$nbprocs procs..." + output_nomon="$base_nomon.$nbprocs.dat" + error_nomon="$base_nomon.$nbprocs.err" + output_mon="$base_mon.$nbprocs.dat" + error_mon="$base_mon.$nbprocs.err" + # actually do the benchmarks + no_monitoring $nbprocs $exe $output_nomon $error_nomon + monitoring $nbprocs $exe $output_mon $error_mon + # prepare data to insert them more easily into database + filter_output $output_nomon $output_mon + # insert into database + echo -e "\n-- Import each CSV file in its corresponding table" >> $dbscript + for op in ${ops[*]} + do + echo -e ".import $(sed "s|$resdir/|$tmpdir/|" <<<$base_mon).${op}.${nbprocs}.csv ${op}_mon\n.import $(sed "s|$resdir/|$tmpdir/|" <<<$base_nomon).${op}.${nbprocs}.csv ${op}_nomon" >> $dbscript + done +done + +echo "Fetch data..." +echo -e "\n-- Perform some select query" >> $dbscript +for op in ${ops[*]} +do + cat >> $dbscript <> $dbscript <> $dbscript <> $dbscript < $plotfile < out ; print $0 > out } else { print $0 > out } }' out=$tmpdir/${op}.dat $tmpdir/${op}.dat + echo -e "set output '$resdir/${op}.png'\nsplot '$tmpdir/${op}.dat' using (\$1):(\$2):(\$3) with pm3d" +done) +EOF + +echo "Generating graphs..." + +gnuplot < $plotfile + +echo "Done." diff --git a/test/monitoring/test_pvar_access.c b/test/monitoring/test_pvar_access.c new file mode 100644 index 00000000000..3c0d5c04eb2 --- /dev/null +++ b/test/monitoring/test_pvar_access.c @@ -0,0 +1,323 @@ +/* + * Copyright (c) 2013-2017 The University of Tennessee and The University + * of Tennessee Research Foundation. All rights + * reserved. + * Copyright (c) 2013-2016 Inria. All rights reserved. + * Copyright (c) 2015 Cisco Systems, Inc. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* +pml monitoring tester. + +Designed by George Bosilca , Emmanuel Jeannot and +Clement Foyer +Contact the authors for questions. + +To be run as: + +mpirun -np 4 --mca pml_monitoring_enable 2 ./test_pvar_access + +Then, the output should be: +Flushing phase 1: +I 0 1 108 bytes 27 msgs sent +I 1 2 104 bytes 26 msgs sent +I 2 3 104 bytes 26 msgs sent +I 3 0 104 bytes 26 msgs sent +Flushing phase 2: +I 0 1 20 bytes 4 msgs sent +I 0 2 20528 bytes 9 msgs sent +I 1 0 20 bytes 4 msgs sent +I 1 2 104 bytes 26 msgs sent +I 1 3 236 bytes 56 msgs sent +I 2 0 20528 bytes 9 msgs sent +I 2 3 112 bytes 27 msgs sent +I 3 1 220 bytes 52 msgs sent +I 3 2 20 bytes 4 msgs sent + +*/ + +#include +#include +#include + +static MPI_T_pvar_handle count_handle; +static MPI_T_pvar_handle msize_handle; +static const char count_pvar_name[] = "pml_monitoring_messages_count"; +static const char msize_pvar_name[] = "pml_monitoring_messages_size"; +static int count_pvar_idx, msize_pvar_idx; +static int world_rank, world_size; + +static void print_vars(int rank, int size, size_t* msg_count, size_t*msg_size) +{ + int i; + for(i = 0; i < size; ++i) { + if(0 != msg_size[i]) + printf("I\t%d\t%d\t%zu bytes\t%zu msgs sent\n", rank, i, msg_size[i], msg_count[i]); + } +} + +int main(int argc, char* argv[]) +{ + int rank, size, n, to, from, tagno, MPIT_result, provided, count; + MPI_T_pvar_session session; + MPI_Status status; + MPI_Comm newcomm; + MPI_Request request; + size_t*msg_count_p1, *msg_size_p1; + size_t*msg_count_p2, *msg_size_p2; + + /* first phase : make a token circulated in MPI_COMM_WORLD */ + n = -1; + MPI_Init(&argc, &argv); + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + world_size = size; + world_rank = rank; + to = (rank + 1) % size; + from = (rank - 1) % size; + tagno = 201; + + MPIT_result = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided); + if (MPIT_result != MPI_SUCCESS) + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + + /* Retrieve the pvar indices */ + MPIT_result = MPI_T_pvar_get_index(count_pvar_name, MPI_T_PVAR_CLASS_SIZE, &count_pvar_idx); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_get_index(msize_pvar_name, MPI_T_PVAR_CLASS_SIZE, &msize_pvar_idx); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot find monitoring MPI_T \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Get session for pvar binding */ + MPIT_result = MPI_T_pvar_session_create(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot create a session for \"%s\" and \"%s\" pvars\n", + count_pvar_name, msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Allocating a new PVAR in a session will reset the counters */ + MPIT_result = MPI_T_pvar_handle_alloc(session, count_pvar_idx, + MPI_COMM_WORLD, &count_handle, &count); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_handle_alloc(session, msize_pvar_idx, + MPI_COMM_WORLD, &msize_handle, &count); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to allocate handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Allocate arrays to retrieve results */ + msg_count_p1 = calloc(count * 4, sizeof(size_t)); + msg_size_p1 = &msg_count_p1[count]; + msg_count_p2 = &msg_count_p1[2*count]; + msg_size_p2 = &msg_count_p1[3*count]; + + /* Start pvar */ + MPIT_result = MPI_T_pvar_start(session, count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_start(session, msize_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + if (rank == 0) { + n = 25; + MPI_Isend(&n,1,MPI_INT,to,tagno,MPI_COMM_WORLD,&request); + } + while (1) { + MPI_Irecv(&n, 1, MPI_INT, from, tagno, MPI_COMM_WORLD, &request); + MPI_Wait(&request, &status); + if (rank == 0) {n--;tagno++;} + MPI_Isend(&n, 1, MPI_INT, to, tagno, MPI_COMM_WORLD, &request); + if (rank != 0) {n--;tagno++;} + if (n<0){ + break; + } + } + + /* Test stopping variable then get values */ + MPIT_result = MPI_T_pvar_stop(session, count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_stop(session, msize_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to stop handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_read(session, count_handle, msg_count_p1); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_read(session, msize_handle, msg_size_p1); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Circulate a token to proper display the results */ + if(0 == world_rank) { + printf("Flushing phase 1:\n"); + print_vars(world_rank, world_size, msg_count_p1, msg_size_p1); + MPI_Send(NULL, 0, MPI_BYTE, (world_rank + 1) % world_size, 300, MPI_COMM_WORLD); + MPI_Recv(NULL, 0, MPI_BYTE, (world_rank - 1) % world_size, 300, MPI_COMM_WORLD, &status); + } else { + MPI_Recv(NULL, 0, MPI_BYTE, (world_rank - 1) % world_size, 300, MPI_COMM_WORLD, &status); + print_vars(world_rank, world_size, msg_count_p1, msg_size_p1); + MPI_Send(NULL, 0, MPI_BYTE, (world_rank + 1) % world_size, 300, MPI_COMM_WORLD); + } + + /* Add to the phase 1 the display token ring message count */ + MPIT_result = MPI_T_pvar_read(session, count_handle, msg_count_p1); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_read(session, msize_handle, msg_size_p1); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* + Second phase. Work with different communicators. + even ranks will circulate a token + while odd ranks will perform a all_to_all + */ + MPIT_result = MPI_T_pvar_start(session, count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_start(session, msize_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to start handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPI_Comm_split(MPI_COMM_WORLD, rank%2, rank, &newcomm); + + if(rank%2){ /*even ranks (in COMM_WORD) circulate a token*/ + MPI_Comm_rank(newcomm, &rank); + MPI_Comm_size(newcomm, &size); + if( size > 1 ) { + to = (rank + 1) % size; + from = (rank - 1) % size; + tagno = 201; + if (rank == 0){ + n = 50; + MPI_Send(&n, 1, MPI_INT, to, tagno, newcomm); + } + while (1){ + MPI_Recv(&n, 1, MPI_INT, from, tagno, newcomm, &status); + if (rank == 0) {n--; tagno++;} + MPI_Send(&n, 1, MPI_INT, to, tagno, newcomm); + if (rank != 0) {n--; tagno++;} + if (n<0){ + break; + } + } + } + } else { /*odd ranks (in COMM_WORD) will perform a all_to_all and a barrier*/ + int send_buff[10240]; + int recv_buff[10240]; + MPI_Comm_rank(newcomm, &rank); + MPI_Comm_size(newcomm, &size); + MPI_Alltoall(send_buff, 10240/size, MPI_INT, recv_buff, 10240/size, MPI_INT, newcomm); + MPI_Comm_split(newcomm, rank%2, rank, &newcomm); + MPI_Barrier(newcomm); + } + + MPIT_result = MPI_T_pvar_read(session, count_handle, msg_count_p2); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_read(session, msize_handle, msg_size_p2); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to fetch handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + /* Taking only in account the second phase */ + for(int i = 0; i < size; ++i) { + msg_count_p2[i] -= msg_count_p1[i]; + msg_size_p2[i] -= msg_size_p1[i]; + } + + /* Circulate a token to proper display the results */ + if(0 == world_rank) { + printf("Flushing phase 2:\n"); + print_vars(world_rank, world_size, msg_count_p2, msg_size_p2); + MPI_Send(NULL, 0, MPI_BYTE, (world_rank + 1) % world_size, 300, MPI_COMM_WORLD); + MPI_Recv(NULL, 0, MPI_BYTE, (world_rank - 1) % world_size, 300, MPI_COMM_WORLD, &status); + } else { + MPI_Recv(NULL, 0, MPI_BYTE, (world_rank - 1) % world_size, 300, MPI_COMM_WORLD, &status); + print_vars(world_rank, world_size, msg_count_p2, msg_size_p2); + MPI_Send(NULL, 0, MPI_BYTE, (world_rank + 1) % world_size, 300, MPI_COMM_WORLD); + } + + MPIT_result = MPI_T_pvar_handle_free(session, &count_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", + count_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + MPIT_result = MPI_T_pvar_handle_free(session, &msize_handle); + if (MPIT_result != MPI_SUCCESS) { + printf("failed to free handle on \"%s\" pvar, check that you have monitoring pml\n", + msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + MPIT_result = MPI_T_pvar_session_free(&session); + if (MPIT_result != MPI_SUCCESS) { + printf("cannot close a session for \"%s\" and \"%s\" pvars\n", + count_pvar_name, msize_pvar_name); + MPI_Abort(MPI_COMM_WORLD, MPIT_result); + } + + (void)MPI_T_finalize(); + + free(msg_count_p1); + + MPI_Finalize(); + return EXIT_SUCCESS; +}