Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/test tomns ci #8

Open
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

hppritcha
Copy link

No description provided.

hppritcha and others added 30 commits August 4, 2020 11:27
problem was that adding --with-hwloc option pointing at a non-default location for an hwloc install
can add hwloc lib info to the global LDFLAGS, which results
in some of the runtime checkes in the pthread configury to fail.  This commit resolves this problem
by using pthreads-only LDFLAGS for building the runtime checks in the pthreads configury.

related to open-mpi#7644

Signed-off-by: Howard Pritchard <[email protected]>
…erhead

- only make MCA parameters available if SPC is enabled

- do not compile SPC code if SPC is disabled

- move includes into ompi_spc.c

- allow counters to be enabled through MPI_T without setting MCA parameter

- inline counter update calls that are likely in the critical path

- fix test to succeed even if encountering invalid pvars

- move timer_[start|stop] to header and move attachment info into ompi_spc_t

There is no need to store the name in the ompi_spc_t struct too, we can use that space
for the attachment info instead to avoid accessing another cache line.

- make timer/watermark flags a property of the spc description

This is meant to making adding counters easier in the future by
centralizing the necessary information. By storing a copy of these flags
in the ompi_spc_t structure (without adding to its size) reduces
cache pollution for timer/watermark events.

- allocate ompi_spc_t objects with cache-alignment

This prevents objects from spanning multiple cache lines and thus
ensures that only one cache line is loaded per update.

- fix handling of timer and timer conversion

- only call opal_timer_base_get_cycles if necesary to reduce overhead

- Remove use of OPAL_UNLIKELY to improve code generated by GCC

It appears that GCC makes less effort in optimizing the unlikely path
and generates bloated code.

- Allocate ompi_spc_events statically to reduce loads in critical path

- duplicate comm_world only when dumping is requested

Signed-off-by: Joseph Schuchart <[email protected]>
A path that was being used in oversubscribed cases caused a help message to
output for each process. This replaces the help message with a debug output to
prevent excessive output unless the user enables debug output.

Signed-off-by: Nikola Dancejic <[email protected]>
the dynamic_gen_file_write_all component distinguishes between the amount of data communicated
to aggregators, and the amount of data written in a cycle by the aggregator (in contrary e.g. to the vulcan component).
There was a bug in calculating which chunks have to be written in a cycle by an aggregator: we added as many elements into the
io_array until we filled one stripe. Unfortuantely, the metric used was the amount of data instead of ensuring that all offsets
fall within a single stripe. This commit fixes this issue. Note, the bug did not create a correctness problem, just a performance
problem in case there were gaps in the file view.

Signed-off-by: Edgar Gabriel <[email protected]>
the lack of performing data sieving has been identified as a main reason for the poor performance in some instances on the Lustre file system. This commit introduces the fundamental ability to perform data sieving for read operations (which should not be controversial). The code itself is correct, what is still lacking is a) the logic when and how to activate data sieving and b) the logic to limit the size of the temporary buffer when doing data sieving.

Signed-off-by: Edgar Gabriel <[email protected]>
only implemented for read at the moment, but the parameters
for write are also in place.

Signed-off-by: Edgar Gabriel <[email protected]>
when using data sieving.

Signed-off-by: Edgar Gabriel <[email protected]>
its however restricted to collective I/O operations, at this point
only from vulcan and dynamic_gen2. required some more infrastructure
to be added to recognize individual I/O and multi-threaded environments.

Signed-off-by: Edgar Gabriel <[email protected]>
Switch PMIx to v4.0 branch
Update PRRTE to current master

Signed-off-by: Ralph Castain <[email protected]>
remove now unused mca parameter, get rid of an unnecesary if-else part,
and move setting the flag outside of the while loop.

Signed-off-by: Edgar Gabriel <[email protected]>
ompio is now the default on Lustre as well

Signed-off-by: Edgar Gabriel <[email protected]>
Update PRRTE pointer to include ULFM fixes
Currently, mca_btl_ofi_put (get, aop, afop, acswp) will allocate
a mca_btl_ofi_rdma_completion_t object and use it as the context
for fi_write/fi_read/fi_atomic/fi_fetch_atomic/fi_compare_atomic.

In normal code path, this completion object when processing completion
entry. However, when error happened when calling

fi_write/fi_read/fi_atomic/fi_fetch_atomic/fi_compare_atomic,

there will be no completion entry from libfabric, in this case the
completion object's memory is leaked.

This patch address the issue by calling opal_free_list_return() in
the error handling code path.

Signed-off-by: Wei Zhang <[email protected]>
btl/ofi: fix memory leaks in error handling path
common/ofi: fixing error message to be a debug output
Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
 * PGI was throwing the following error.
```
NVC++-S-0103-Illegal operand types for comparison operator (osc_rdma_frag.h: 75)
NVC++/power Linux 20.11-0: compilation completed with severe errors
```
 * It must not have liked the inline declaration of the NULL pointer.
   - So replace with a variable, as we do in other places in the code base.

Signed-off-by: Joshua Hursey <[email protected]>
Signed-off-by: Zach Osman <[email protected]>
A typo (missing $) prevented --with-pmix=internal from working as expected

Thanks Zach Osman for reporting this.

Refs open-mpi#8326

Signed-off-by: Gilles Gouaillardet <[email protected]>
Removing config/opal_check_pmi.m4
Fix PGI compiler error with compare arg
jsquyres and others added 30 commits January 7, 2021 08:53
A started generalized request should be marked as pending.
Remove some left-over infrastructure for handling callbacks into the
MPI C++ bindings (which were removed long ago -- this code is now
stale).

Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Aboorva Devarajan <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
…ion-pointers

ompi/op: remove C++ function pointers
…ndler

ompi/errhandler: fix comm errhandler issue
 * `--mca ompi_display_comm VALUE` where `VALUE` is one or more of:
   - `mpi_init` : Display during `MPI_Init`
   - `mpi_finalize` : Display during `MPI_Finalize`
 * hook/comm_method: Use enum flags to select protocols

Signed-off-by: Joshua Hursey <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
Allow fallback to a lesser AVX support during make
Update hook component to use enum MCA parameter
This commit fixes a bug discovered while debugging issue open-mpi#8350 Running our testsuite on Mac OS revealed that posted a large number of non-blocking read/write operations leads to an error message on this platform. A fix is already available and will be committed shortly.

The issue stems from limitations on macOs and the concurrent number of aio_read/aio_write operations that can be pending. While the code already handled that correctly for a single request, this bug exposed that the overall limited has to be respected across all pending requests.

The solution is to invoke mca_common_ompio_progress if we cannot post new aio operations.

Fixes issue open-mpi#8368

Signed-off-by: Edgar Gabriel <[email protected]>
icc does not define the __AVX*__ macros if the corresponding -m architecture
flag was not provided. Thus, make sure we always provide it for icc (not not
necessarily for gcc).

Signed-off-by: George Bosilca <[email protected]>
fbtl/posix: ensure progressing aio requests
Enable AVX support with Intel compilers
Yes, it was done by hand and is therefore fragile - but somebody had to do something as people think we dropped support for various runtime-based things.

Signed-off-by: Ralph Castain <[email protected]>
Expose the PMIx/PRRTE unique configure args
Update both PMIx and PRRTE. Ensure the MPI proc properly notifies the daemon when it is pausing for debugger attach

Signed-off-by: Ralph Castain <[email protected]>
Switch back to PMIx master branch
Signed-off-by: wiltonloch <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
Thanks to Andreas Lösel for bringing the outdated docs to our
attention.

Signed-off-by: Jeff Squyres <[email protected]>
…-man-page-updates

MPI_Init_thread(3): update refs about MPI_THREAD_MULTIPLE
…-fix

Fix error with stricter quoting requirements of autoconf-2.70
Signed-off-by: Howard Pritchard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.