Version 1.31: Sparse CTF + CUDA support + performance model improvements
New features/updates:
- Renewed support for CUDA-based offloading
- New macro -DTUNE, activates model tuning, which adjusts performance model parameters. A benchmark aimed to execute a characteristic training set is included in bench/model_trainer and can be used to tune models for any architecture. Output at the end of the benchmark should be pasted into src/shared/init_models.cxx. It is not advisable to always run with -DTUNE on. Current parameters are based on a 16 node Edison runs with 4 processes per node and should be reasonable in most settings. Cubic polynomial model also available, but not used due to inferior observed modelling quality.
- Added a mapping search that exhaustively tries all possible mappings. As no effective benefit was observed from this expanded search space and the search itself slowed time down for low-cost contractions (e.g. in CCSDT), this mapping search is only done when the time estimated by the old scheme is longer than 1 second.
- Fixed up configure file and generated config.mk file to be more effective and better documented.