wava
is a linux command line tool wich allows for multiple users to securely run batch processes, scheduded in a timely manner under contraints of allocated (resident) memory capacities.
Table of contents
- Top
- Overview
- Features
- Architecture
- Priority and groups
- Events
- Job hierarchy
- Requirements
- Installation
- Configuration
- Running
- Support bugs and requests
- Authors
- License
wava
scheduler is designed to run batch processes in a single Linux machine in an operator-friendly manner while imposing a memory limit across the overall job set (strict limits) and guaranteeing a minimum of available memory per job (soft limits).
Traditionally fixed-sized queues were used for enqueueing jobs, but when used over a heterogeneus (in terms of memory demands) set of jobs, they show different weaknesses: inefficient resource utilization, system performance degradation, and job resource competition.
wava
scheduler is designed to overcome this, that is, without losing easy of use, providing guarantees for both system stability and job resource allocation and, on the other side, offering better resource utilization rates.
Jobs are submitted with a minimum memory claim (job size), then enqueued, and finally executed in a sandboxed environment (implemented via cgroups) with the scheduler capacity as hard limit and the jobs size as soft limit.
At runtime, the job process tree is allowed to allocate an amount of resident memory up to the scheduler-capacity (if no more jobs are scheduled), and in case of memory pressure forced to swap out.
The scheduler guareantees that each job process tree has at its disposition at least the amount of minimum memory claimed at job submission.
Job processes are run by the same machine user that submitted the job, so the scheduler can not be used to escale the running privileges of a user.
Scheduling is based on memory. The scheduler is configured to have a certain capacity and each job has an associated minimum memory size.
The main scheduling constraint is the following: the sum of the running jobs minimum memory size cannot exceed the scheduler capacity.
This feature allows jobs to be submitted and scheduled with different priorities.
All jobs belong (implicity or explicity) to a priority group that determines their global ordering, used for positioning in the queue and assigning a process niceness when running.
Both global and per job statistics can be recorded.
Statistics sensitivity can be tuned with the statsCpuStep
,statsRssStep
,statsSwapStep
and statsIOStep
configuration parameters.
#start end running queued cpu(%) rss(B) swap(B) io(B/s)
1494930723326 1494930723326 0 0 0.0 0 0 0
1494930723326 1494930731348 1 0 0.0 0 0 0
1494930731348 1494930733355 1 0 92.7 696320 0 0
...
To enable global statistics set the "logStat"=true"
in the configuration file.
At submit time (wava -r
) specify an additional parameter -s
with the folder to store the stats files for this only execution.
The scheduler runs as a centralized process (wava -s
) and all interaction with it is performed by separated client processes started when invoking the other command options.
In particular job submissions are performed by separate peer processes (wava -r
) that serve as lightweight placeholders of the real jobs executed by the scheduler. Peer and job processes have their lifecycle bound to each other. If one dies the other dies too.
The scheduler pipes standard io-streams between the job processes and their respective peer processes. Additionally, it pipes scheduler events to the peer stderr
unless an event file has been specified in submission (wava -r -e <file>
).
Running an example command requiring a minimum of 100MB. Observe also that this command runs untils user 'nacho' cancels it, returning a non-zero return code
All submitted jobs belong to a priority group that determines their global ordering, used for positioning in the queue and assigning a process niceness when running.
Besides priority
groups also have a timeToIdleSeconds
property. This is the time elapsed between the last job finishes and the group is removed. If this value is set to -1
, the group is eternal.
Jobs that do not specify a group at submit time are assigned to the default
group (priority=0
, timeToIdleSeconds=-1
).
Jobs that specify a non-existing group create at submit-time a dynamic group (priority=0
, timeToIdleSeconds
specified at configuration).
Sample output of command wava -g -l
for querying groups
Jobs are ordered by the following rules:
- First by group priority (lower value means higher priority)
- Then by group id (incremental). In case of same priority, jobs of the oldest group go first.
- Finally, by job id (incremental). For jobs inside the same group, FIFO ordering.
Sample output of command wava -j
for querying jobs (white: running, yellow: queued). This scheduler instance has a capacity of 500 MB
The scheduler sets the niceness of the job processes according to their global ordering within the working niceness range. The concrete strategy is determined by the NicenessHandler
implementation used (set in configuration).
Besides stderr
and stdout
, the scheduler process maintains a dedicated channel (named pipe) for communicating events to client processes. These events are serialized in the form:
${time-millis}:${event-type}[:${event-value}]
For peer processes these events are output to stderr
after being formatted as [wava] [date] [${event-type}:${event-value}]
unless a file is specified (wava -r -e <file>
) for redirecting them.
Event type (Events.java ) |
Valued | Description |
---|---|---|
id |
yes | Id assigned to the job. |
queued |
yes | Position in the queue, if the job is queued. |
priority |
yes | Piority of the job, given by its group. |
running |
yes | Root pId of the job process when started. |
niceness |
yes | Niceness set to the job process. |
cancelled |
yes | User cancelling the job. |
ping |
no | Send periodically to detect stale peers. |
exceed_tree |
yes | Memory claim exceeds capacity |
shutdown |
yes | Scheduler is being stopped |
maxrss |
yes | Max RSS allocated to the process tree |
maxswap |
yes | Max swap allocated to the process tree |
error |
yes | To send information about an error. |
retcode |
yes | Return code for the client process to use. |
starvation_relaunch |
yes | Indicates that the job has been reenqueued due to a starvation scenario (applies for idempotent jobs) |
starvation_stop |
yes | Indicates that the job has been stopped due to a starvation scenario (applies for non-idempotent jobs) |
Running jobs can submit more jobs, thus a job hierarchy is established. This potentially can lead to a deadlock scenario, when all parent (running) jobs are waiting for a queued child job to finish.
All parent jobs with no children running are considered blocked, ie waiting for their queued children to finish.
In order to avoid deadlock, and prevent from starvation (having too much jobs blocked by a waiting children), the scheduler follows a series of rules that may force a running blocking job to be re-enqueued (if submitted as 'idempotent') or even stoped.
On each scheduler main-loop iteration:
- The scheduler choses a candidate job to be preempted, based on its idempotency (idempotent first) and priority (low priority first).
- In case that the ratio of the sum of the sizes of the blocked jobs to the scheduler capacity exceeds a configurable value, the scenario is considered as starving, and the scheduler preempts the candidate job to make room for a potentially blocking job to run.
$JAVA_HOME
environment variable set pointing to a JRE 8+
export WAVA_HOME=/opt/wava
make this variable persistent adding the previous line to the file: ~root/.bashrc
Multiple scheduler instances are supported in a single machine. Just change the current
WAVA_HOME
value to point to one or another
sudo -E bash -c "$(curl -L https://raw.githubusercontent.com/brutusin/wava/master/wava-core/src/main/scripts/wava-update)"
- Create service file
/etc/systemd/system/wava.service
with the following contents:
[Unit]
#@author Ignacio del Valle Alles
Description=WAVA scheduler
[Service]
Type=forking
ExecStart=/opt/wava/bin/wava -s &
ExecStop=/opt/wava/bin/wava -x
# set delegate yes so that systemd does not reset the cgroups
Delegate=yes
[Install]
WantedBy=multi-user.target
- Reload
systemd
configuration:
sudo systemctl daemon-reload
- Enable for running at startup:
sudo systemctl enable wava.service
- Init service:
sudo systemctl start wava.service
In version 219 it is neccessary to set JoinControllers=
at /etc/systemd/system.conf
to have different controllers for cpuacct
and cpu
cgroup subsystems, otherwise the cgroup of the jobs cannot be changed at runtime. Maybe this configuration change requires to rebuild the initial ramdisk image to take effect.
Configuration is set in file: $WAVA_HOME/cfg/wava.json
. Environment variables can be used in this file.
"tempFolder" : "/dev/shm",
"uICfg" : {
"ansiColors" : true,
"sIMemoryUnits" : true
},
"schedulerCfg" : {
"nicenessHandlerClassName" : "org.brutusin.wava.core.plug.impl.niceness.HomogeneusSpreadNicenessHandler",
"cgroupRootPath" : "/sys/fs/cgroup",
"refreshLoopSleepMillisecs" : 1000,
"pingMillisecs" : 1000,
"schedulerCapacity" : "$DEFAULT_CAPACITY",
"maxSwap" : "$DEFAULT_SWAP",
"maxJobSize" : "$DEFAULT_CAPACITY",
"outOfMemoryKillerEnabled" : false,
"maxBlockedRssStarvationRatio" : 0.5,
"logFolder" : "/tmp/wava",
"loggingLevel" : "FINE",
"maxLogSize" : "100MB",
"maxStatsLogSize" : "100MB",
"statsCpuStep" : 15,
"statsRssStep" : "50MB",
"statsSwapStep" : "50MB",
"statsIOStep" : "50MB",
"logStats" : true
},
"processCfg" : {
"nicenessRange" : [ 1, 19 ],
"cpuAfinity" : "$DEFAULT_CPU_AFINITY"
},
"groupCfg" : {
"dynamicGroupIdleSeconds" : 10,
"predefinedGroups" : [ {
"name" : "high",
"priority" : -10,
"timeToIdleSeconds" : -1
}, {
"name" : "low",
"priority" : 10,
"timeToIdleSeconds" : -1
} ]
}
}
Property | Description |
---|---|
uICfg.ansiColors |
Use ANSI escape code sequences to highlight UI. |
uICfg.sIMemoryUnits |
Use units from the International System for output memory values. true : kB based, false :KiB based |
schedulerCfg.nicenessHandlerClassName |
FQN of the NicenessHandler implementation (see impl package) to use. |
schedulerCfg.memoryCgroupBasePath |
Root path to the parent memory cgroup |
schedulerCfg.refreshLoopSleepMillisecs |
Sleeping time for the main looping thread. |
schedulerCfg.pingMillisecs |
Time interval between ping events to peer processes. |
schedulerCfg.schedulerCapacity |
Scheduler capacity. Maximum amount of physical memory permitted for all jobs. By default is 3/4 of total memory. Different memory units can be used, for example 4 GB |
schedulerCfg.maxSwap |
Maximum swap size to be used by all wava jobs. By default equals to the total amount of swap available in the system |
schedulerCfg.maxJobSize |
Maximum value for a job memory claim. By default equal to the scheduler capacity |
schedulerCfg.outOfMemoryKillerEnabled |
Enable/disable the Out Of Memory Killer, triggered when a job is forced to page out and there is no enough swap memory available. If disabled the job is stopped until enough memory is available. |
schedulerCfg.maxBlockedRssStarvationRatio |
Maximum ratio between the sum of memory claims of the blocked jobs divided by the scheduler capacity. If exceeded the starvation prevention mechanism is triggered. |
schedulerCfg.logFolder |
Folder to store logs and global stats (if enabled). |
schedulerCfg.loggingLevel |
Logging level (According to the Java logging levels) |
schedulerCfg.maxLogSize |
Maximum size allowed overall logging files |
schedulerCfg.maxStatsLogSize |
Maximum size allowed overall stats files for global stats, and per job stats if enabled |
schedulerCfg.statsCpuStep |
Stats cpu percentage precission |
schedulerCfg.statsRssStep |
Stats rss memory precission |
schedulerCfg.statsSwapStep |
Stats swap memory precission |
schedulerCfg.statsIOStep |
Stats io bandwidth precission (per second) |
schedulerCfg.logStats |
true to enable global stats logging |
processCfg.nicenessRange |
Minimum (most favorable) and maximum (less favorable) niceness to be assigned to a job process tree |
processCfg.cpuAfinity |
CPU affinity to be set to the job processes. In a format supported by the -c parameter of taskset. |
groupCfg.dynamicGroupIdleSeconds |
Idle time for dynamic groups in seconds. |
groupCfg.predefinedGroups |
Set of groups to be available since startup. |
> wava -s&
Logging to /tmp/wava/logs ...
> wava -r -m 100MB -s /tmp/wava/my-job-stats bash -c "while true; do date; done" &> /dev/null &
> wava -j
Jobs: 1 running; 0 bloqued; 0 queued
Available memory: 23.8 GB / 24.7 GB
JOB INFO 100.0 MB PROCESS TREE STATS 696.3 kB 0 B 0 B 92.7 COMMAND
JOB ID PARENT GROUP USER JOB_RSS NICE MAX_RSS MAX_SWAP MAX_IO RSS SWAP IO CPU%
1 default root 100.0 MB 1 696.3 kB 0 B 0 B/s 696.3 kB 0 B 0 B/s 92.7 [bash, -c, while true; do date; done]
> wava -c 1
Running job sucessfully cancelled
> wava -x
Stopping scheduler process ...
Execute the wava
command for detailed help
> wava
___________________________________________
__ __ _____ __ __ _____
/ / / _ / | / _ \
| /\ | _ \ | | _ |
\__/\__\__|__/\___/\__|__/
[W]hen [AVA]ilable scheduler 2.3.1-SNAPSHOT
___________________________________________
usage: wava [option]
-a,--about information about the program
-c,--cancel cancel a running or enqueued job
-g,--group group management commands
-h,--help print this message
-j,--jobs view jobs
-r,--run enqueue a job to be executed when enough physical memory is available
-s,--start start core scheduler process
-t,--status return core process status
-u,--update update to lastest version
-v,--version show wava version
-x,--exit stop core process, terminating all jobs
https://github.com/brutusin/wava/issues
- Ignacio del Valle Alles (https://github.com/idelvall/)
Contributions are always welcome and greatly appreciated!
Apache License, Version 2.0