diff --git a/RFC0026.md b/RFC0026.md new file mode 100644 index 0000000..ada4ca6 --- /dev/null +++ b/RFC0026.md @@ -0,0 +1,277 @@ +# RFC0026 + +## Title +Power Control for Applications and Tools + +## Abstract +Power control of high-performance computing clusters is of growing interest among large-scale systems and power-constrained installations. A preliminary API [specification](http://powerapi.sandia.gov/) has been published that provides atomistic control and measurement functions. This RFC provides an abstraction layer that also simplifies the interfaces. + +## Labels +[EXTENSION][ATTRIBUTES] + +## Action +[IN PROGRESS] + +## Copyright Notice +Copyright 2017 Intel, Inc. All rights reserved. + +This document is subject to all provisions relating to code contributions to the PMIx community as defined in the community's [LICENSE](https://github.com/pmix/RFCs/tree/master/LICENSE) file. Code Components extracted from this document must include the License text as described in that file. + +## Description +Power control of high-performance computing clusters is of growing interest among large-scale systems and power-constrained installations. The precise nature of this control - e.g., setting of overall power envelope for a job by the scheduler vs. application-level control of individual core frequency settings - remains a subject of research. The goal of this RFC is not to replicate the work of others focused on providing the low-level controls, but rather to provide abstraction interfaces that (a) simplify the user interaction, and (b) allow for evolving specifications and alternative approaches without forcing changes into the resource managers and user applications. + +The initial level of support covers three broad areas: + +* request allocation of a given power level for an application at time of job submission - this would specify the desired aggregate power allocation for the application, integrated across all allocated resources/nodes. This is an intentionally vague window as it doesn't specify whether or not it includes, for example, the power used by the parallel file system while supporting file operations for this application. Supported levels will initially be defined both as a broad category (e.g., "low", "high") and specific value. + +* query current power level settings and current actual power usage. Support for both cpu and network resources will be included, subject to the PMIx "not supported" rule + +* set/modify the overall power level for a running application. The initial proposal is to support this at the total application level vs specifying the frequency or power of an individual cpu. + +The query of current power settings and power usage shall be supported via the existing PMIx\_Query\_info\_nb interface via definition of appropriate attributes. Both power-specific attributes and associated physical factors such as cpu frequency and governors shall be covered. The following attributes and their returned value are included for this purpose: +``` +#define PMIX_QUERY_POWER_CAP "pmix.query.pcap" // (double) current power cap for the target resource +#define PMIX_QUERY_POWER_SETTING "pmix.query.psetting" // (double) current power setting for the target resource +#define PMIX_QUERY_POWER_USAGE "pmix.query.puse" // (double) current power level of the target resource +#define PMIX_QUERY_FREQ_SETTING "pmix.query.freq" // (double) current frequency setting (GHz) of the target resource +#define PMIX_QUERY_CPU_FREQ "pmix.query.cfreq" // (double) current operating frequency (GHz) of the target cpu +#define PMIX_QUERY_CPU_GOVERNOR "pmix.query.cgov" // (char*) current governor covering the target cpu +#define PMIX_QUERY_AVAIL_GOVERNORS "pmix.query.govs" // (char*) comma-delimited list of available governors + // on the specified nodes/cpus +``` +The query shall also allow specification of target resources (e.g., the node or cpu whose power usage is being requested) using the _qualifiers_ array of the pmix_query_t structure, as illustrated in the example provided below. Specifying the target node for these queries can be done via the PMIX\_HOSTNAME for a particular node, or using the PMIX\_NODE\_MAP attribute combined with a supported regular expression (e.g., one generated via the PMIx\_generate\_regex function). Specifying the target cpu(s) can be done using the PMIX\_CPU\_LIST attribute with a string value for the logical cpu number. + +Note that single requests spanning multiple cpus will return a pmix\_info\_t containing a pmix\_data\_array\_t value, with each element of the array containing the requested info for the cpu in the corresponding position in the request. For example, a request for the frequency of cpus "1,3,5" will return a pmix\_data\_array\_t value comprised of three pmix\_info\_t structures containing (in order) the frequency of cpu 1, cpu 3, and cpu 5. + +Similarly, setting the power level and associated factors for a running application shall be supported via the existing PMIx\_Job\_control\_nb interface through the definition of the following attributes: +``` +/* define some general power categories */ +#define PMIX_POWER_HIGH "pmix.pwr.high" +#define PMIX_POWER_LOW "pmix.pwr.low" + +#define PMIX_JOB_CONTROL_SET_POWER_CAP "pmix/jctrl.spcap" // (double) set power cap for target resource +#define PMIX_JOB_CONTROL_SET_POWER_LEVEL "pmix.jctrl.splvl" // (char*) general power level for target resource +#define PMIX_JOB_CONTROL_SET_POWER "pmix.jctrl.spwr" // (double) set power of target resource to specified level +#define PMIX_JOB_CONTROL_SET_FREQ "pmix.jctrl.sfreq" // (double) set frequency (GHz) of target resource to specified value +#define PMIX_JOB_CONTROL_SET_CPU_GOV "pmix.jctrl.cgov" // (char*) set governor of target cpu to specified version +``` +Power allocation requests, once granted, can be implemented by the host RM via the job control interface. Specification of a standard script-level syntax for power allocation requests is currently beyond the scope of the PMIx Standard and is left to the scheduler provider. + +## Security +The host RM is responsible for obtaining and enforcing authorizations for a given application/user relative to any specific request. The effective identity of the user and group are provided to the host RM by the PMIx server in the qualifiers of the respective query and job control function calls. Those desiring stronger credentials can obtain and validate them via the PMIx security APIs. + +## Example +The specification of target resources for a query involves manipulation of the pmix\_query\_t object. The following example requests the current power usage and available governors for several cpus on a given node: +``` +#include + +typedef struct { + pthread_mutex_t mutex; + pthread_cond_t cond; + volatile bool active; + pmix_status_t status; +} mylock_t; + +#define DEBUG_CONSTRUCT_LOCK(l) \ + do { \ + pthread_mutex_init(&(l)->mutex, NULL); \ + pthread_cond_init(&(l)->cond, NULL); \ + (l)->active = true; \ + (l)->status = PMIX_SUCCESS; \ + } while(0) + +#define DEBUG_DESTRUCT_LOCK(l) \ + do { \ + pthread_mutex_destroy(&(l)->mutex); \ + pthread_cond_destroy(&(l)->cond); \ + } while(0) + +#define DEBUG_WAIT_THREAD(lck) \ + do { \ + pthread_mutex_lock(&(lck)->mutex); \ + while ((lck)->active) { \ + pthread_cond_wait(&(lck)->cond, &(lck)->mutex); \ + } \ + pthread_mutex_unlock(&(lck)->mutex); \ + } while(0) + +#define DEBUG_WAKEUP_THREAD(lck) \ + do { \ + pthread_mutex_lock(&(lck)->mutex); \ + (lck)->active = false; \ + pthread_cond_broadcast(&(lck)->cond); \ + pthread_mutex_unlock(&(lck)->mutex); \ + } while(0) + +/* define a structure for collecting returned + * info from a query */ +typedef struct { + mylock_t lock; + pmix_info_t *info; + size_t ninfo; +} myquery_data_t; + +/* this is a callback function for the PMIx_Query + * API. The query will callback with a status indicating + * if the request could be fully satisfied, partially + * satisfied, or completely failed. The info parameter + * contains an array of the returned data, with the + * info->key field being the key that was provided in + * the query call. Thus, you can correlate the returned + * data in the info->value field to the requested key. + * + * Once we have dealt with the returned data, we must + * call the release_fn so that the PMIx library can + * cleanup */ +static void cbfunc(pmix_status_t status, + pmix_info_t *info, size_t ninfo, + void *cbdata, + pmix_release_cbfunc_t release_fn, + void *release_cbdata) +{ + myquery_data_t *mq = (myquery_data_t*)cbdata; + size_t n; + + /* save the returned info - the PMIx library "owns" it + * and will release it and perform other cleanup actions + * when release_fn is called */ + if (0 < ninfo) { + PMIX_INFO_CREATE(mq->info, ninfo); + mq->ninfo = ninfo; + for (n=0; n < ninfo; n++) { + fprintf(stderr, "Transferring %s\n", info[n].key); + PMIX_INFO_XFER(&mq->info[n], &info[n]); + } + } + + /* let the library release the data and cleanup from + * the operation */ + if (NULL != release_fn) { + release_fn(release_cbdata); + } + + /* release the block */ + DEBUG_WAKEUP_THREAD(&mq->lock); +} + +int main(int argc, char **argv) +{ + pmix_status_t rc; + pmix_query_t *query; + myquery_data_t myquery_data; + size_t n, k; + pmix_info_t *info; + + /* initialize - assume we are a client proc for now */ + rc = PMIx_Init(&myproc, NULL, 0); + if (PMIX_SUCCESS != rc) { + fprintf(stderr, "Init failed\n"); + exit(1); + } + + /* setup the caddy to retrieve the data */ + DEBUG_CONSTRUCT_LOCK(&myquery_data.lock); + myquery_data.info = NULL; + myquery_data.ninfo = 0; + + /* create the query itself */ + PMIX_QUERY_CREATE(query, 1); + + /* setup the query for current power usage */ + PMIX_ARGV_APPEND(&query[0].keys, PMIX_QUERY_POWER_USAGE); + /* request the available governors */ + PMIX_ARGV_APPEND(&query[0].keys, PMIX_QUERY_AVAIL_GOVERNORS); + + /* specify the node and cpus */ + PMIX_INFO_CREATE(query[0].qualifiers, 2); + PMIX_INFO_LOAD(&query[0].qualifiers[0], PMIX_HOSTNAME, "nodeA", PMIX_STRING); + PMIX_INFO_LOAD(&query[0].qualifiers[1], PMIX_CPU_LIST, "0,24,12,25", PMIX_STRING); + + /* execute the query */ + rc = PMIx_Query_info_nb(query, 1, qcbfunc, (void*)&myquery_data); + if (PMIX_SUCCESS != rc) { + fprintf(stderr, "PMIx_Query_info failed: %d\n", rc); + goto done; + } + DEBUG_WAIT_THREAD(&myquery_data.lock); + DEBUG_DESTRUCT_LOCK(&myquery_data.lock); + + /* check to see if something went wrong */ + if (NULL == myquery_data.info || 0 == myquery_data.ninfo) { + fprintf(stderr, "PMIx_Query returned no data\n"); + PMIX_QUERY_FREE(&query, 1); + goto done; + } + + /* we asked for two sets of data, so there should be two answers */ + if (2 != myquery_data.ninfo) { + /* this is an error */ + fprintf(stderr, "PMIx Query returned an incorrect number of results: %lu\n", myquery_data.ninfo); + PMIX_INFO_FREE(myquery_data.info, myquery_data.ninfo); + PMIX_QUERY_FREE(&query, 1); + goto done; + } + + /* Note that the PMIx reference server always returns the query results + * in the same order as the query keys. However, this is not guaranteed, + * so we should search the returned info structures to find the desired key + */ + for (n=0; n < myquery_data.ninfo; n++) { + + /* since we asked for usage and governors for multiple cpus, + * we expect the query to have returned a pmix_data_array_t + * containing the info for each cpu */ + if (PMIX_DATA_ARRAY != myquery_data.info[n].value.type || + NULL == myquery_data.info[n].value.darray || + PMIX_INFO != myquery_data.info[n].value.darray->type) { + fprintf(stderr, "PMIx Query returned an incorrect value type: %s\n", PMIx_Data_type(myquery_data.info[n].value.type)); + PMIX_INFO_FREE(myquery_data.info, myquery_data.ninfo); + PMIX_QUERY_FREE(&query, 1); + break; + } + + /* we asked for four cpus, so we must have four answers */ + if (4 != myquery_data.info[n].value.darray->size) { + fprintf(stderr, "PMIx Query returned an incorrect number of values: %lu\n", myquery_data.info[n].value.darray->size)); + PMIX_INFO_FREE(myquery_data.info, myquery_data.ninfo); + PMIX_QUERY_FREE(&query, 1); + break; + } + + info = (pmix_info_t*)myquery_data.info[n].value.darray; + + if (0 == strcmp(myquery_data.info[n].key, PMIX_QUERY_POWER_USAGE)) { + for (k=0; k < myquery_data.info[n].value.darray->size; k++) { + /* per the standard, the value is a double */ + fprintf(stdout, "CPU: %s Power: %f\n", info[k].key, info[k].value.dval); + } + } else if (0 == strcmp(myquery_data.info[n].key, PMIX_QUERY_AVAIL_GOVERNORS)) { + for (k=0; k < myquery_data.info[n].value.darray->size; k++) { + /* per the standard, the value is a comma-delimited string */ + fprintf(stdout, "CPU: %s Avail governors: %s\n", info[k].key, info[k].value.string); + } + } else { + fprintf(stderr, "PMIx Query returned an unknown key: %s\n", myquery_data.info[n].key); + PMIX_INFO_FREE(myquery_data.info, myquery_data.ninfo); + PMIX_QUERY_FREE(&query, 1); + break; + } + } + PMIX_INFO_FREE(myquery_data.info, myquery_data.ninfo); + PMIX_QUERY_FREE(&query, 1); + + done: + PMIx_Finalize(NULL, 0); + return 0; +} + +``` +Note that obtaining the info for multiple nodes can be accomplished by providing a comma-delimited list of hostnames to the PMIX\_HOSTNAME key. Providing a NULL value will act as a wildcard to retrieve the info for all nodes in the allocation. + +## Protoype Implementation +TBD + +## Author(s) +Ralph H. Castain +Intel, Inc. +Github: rhc54