Skip to content

Running the worker in the Amazon AWS EC2 cloud

vdbergh edited this page Jan 8, 2024 · 67 revisions

This page is updated as of January 2024.

It's possible to run the fishtest worker using compute instances on Amazon Web Services (AWS) Elastic Compute Cloud (EC2). Rather than using "on-demand" instances, the more cost-effective option is to use a Spot instance, that allows to get 'spare CPU time' at a reduced cost. A Spot instance might be stopped by AWS, but this is no problem for fishtest. This bills your credit card at a rate that you can monitor and control (typically 0.50-0.70$/hour for the high grade instance needed for fishtest).

Get your personal account and password for fishtest

http://tests.stockfishchess.org/signup

Make a note of your username and password, you will need them later.

Get your personal account at Amazon Web Services (AWS)

http://aws.amazon.com/

Creating your Amazon Web Services (AWS) account is free, but requires a credit card. Register for AWS using a name and password that is different from the fishtest one! Keep your AWS password and username secure.

Starting fishtest worker instances to contribute to Fishtest

With the above two usernames and passwords, you can launch AWS Elastic Compute Cloud (EC2) instances relatively easily. Ignore the wide variety of options offered by AWS, the defaults will work fine.

Create your instance

Pricing for Amazon EC2 operates on two main models: 'on-demand' and 'spot'. On-demand instances, like c4.8xlarge, are costly (over per hour) but offer immediate, uninterrupted service. These are ideal for time-sensitive tasks. However, if Amazon has available c4.8xlarge capacity and no on-demand requests, they offer this excess capacity to 'spot' users at a lower rate. Spot instances are less reliable—they're stopped when an on-demand user needs them—but they're cost-effective for tasks that don't require continuous running, like our fishtest worker, where intermittent operation still yields output the fishtest framework can use. Below, we are creating a spot instance that will run the fishtest worker only when the price of a spot instance dips below the amount per hour that we set.

  1. Open this link https://console.aws.amazon.com/ec2/ and after login to AWS, you will be at your EC2 dashboard.
  2. Click on the orange button "Launch Instance".
  3. Go through the various instance configuration steps:
    1. Under "Name and tags" give it a name, any name will do.
    2. Under "Application and OS Images (Amazon Machine Image)" click on the third one, "Ubuntu" and leave all other settings in that area at their defaults.
    3. Under "Instance type" you must select the type of instance. Typically the default is t2.micro, but you should NOT use that. Click on the text describing the t2.micro instance and a drop-down style menu will open and a search bar will appear. Type "c4.8x" and only one instance type will show: the c4.8xlarge instance type. Select this c4.8xlarge instance type.
    4. Under "Key pair (login)" you can select the first option, "Proceed without a key pair". This just means you need to manage (start, stop, terminate) the worker from the EC2 web screen and will not be logging in via ssh.
    5. Leave all other things as they are, but very important, expand the "Advanced details" option at the bottom of the page.
    6. Scroll down under "Advanced details" and find the area "Purchasing option" and check the box for "Request Spot Instances".
    7. After you click "Request Spot Instances" next to it there will appear a blue link "Customize". Click that and the options will open below.
    8. Under "Maximum price" you should indicate the maximum price you are willing to pay, per hour, to have Amazon EC2 launch a worker with the settings you are specifying. It is simplest to choose "No maximum price". You will be paying the spot price which is typically less than 40% of the on-demand price. You will never pay more than the on-demand price.
    9. Under "Request type" select "Persistent" so that when your spot instance is interrupted (because your maximum price falls below the spot price or because AWS needs the capacity) your spot instance will be re-launched when it can be run again.
    10. Under "Interruption behavior" select "Stop".
    11. Scroll down to the text box under "User data - optional " and copy-paste the script below. Remember that you should replace USERNAME and PASSWORD (at the start of the script) with your fishtest username and password. Note that spaces, line breaks, etc., matter. The script updates the software and installs the essential tools, including fishtest. Furthermore, it disables hyperthreading as preferred for running fishtest.
    12. Click the orange Launch Instance box all the way at the bottom. Wait 2-3 minutes and then check on the status of your instance.
Script (click to view)
#!/bin/bash

# ===================== parameter section ===================== 

# Replace USERNAME and PASSWORD with your fishtest username and password.
# The double quotes deal with symbols in username or password: don't delete them.

username="USERNAME"
password="PASSWORD"

# Partition the available cores into independent fishtest workers.
# The sum of the parts should be the number of physical cores - 1
# when hyper threading is disabled (see below). Otherwise it
# should be the number of virtual cores - 1.
# Order the parts in descending order to minimize the time control error
# for the first batch of games (the time control is adjusted according to
# the CPU load).

core_partition="14 3"  # suitable for the c4.8xlarge instance

# If you run multiple instances, change this into a different
# uuid_prefix for each one (e.g. ec2awrk, ec2bwrk, ...). For a single
# instance you can leave this at the default.
# Currently the uuid_prefix can be at most 8 alpha-numeric characters

uuid_prefix=ec2wrk  # note: the worker number will be appended to this

# The default is to disable hyper-threading. If you enable it, check
# that there are no time losses or crashes.

disable_ht=1 # 0 to enable hyper-threading

# ===================== start install script ===================== 

# Output for this script goes to
# /var/log/cloud-init-output.log
# Check this file for errors if things do not work as intended.

echo "Updating software..."

apt update && apt full-upgrade -y && apt autoremove -y && apt clean
apt install -y python3 build-essential unzip

echo "Disabling the attached storage..."

# The storage that AWS attaches to some instances may create a long
# running kernel process ext4lazyinit which increases the load average
# by 1, complicating our monitoring (see below). Since we do not need
# the extra storage, we attempt to simply disable it.

sed -Ei '/\/mnt/{s/^#//;s/^/#/}' /etc/fstab

echo "Creating the worker user..."

worker_user=fishtest
useradd -m -s /bin/bash ${worker_user}
# get the worker_user $HOME
worker_user_home=$(sudo -i -u ${worker_user} << 'EOF'
  echo ${HOME}
EOF
)

echo "Creating the worker directories and filling them..."

sudo -i -u ${worker_user} << EOF
wget https://github.com/official-stockfish/fishtest/archive/master.zip
partitionid=0
uuid_prefix=$uuid_prefix
for partition in ${core_partition}
do
  partitionid=\$((partitionid+1))
  mkdir worker\$partitionid
  cd worker\$partitionid
  unzip ../master.zip
  cd fishtest-master/worker
  # create config file
  python3 worker.py "$username" "$password" --concurrency \$partition -u \$uuid_prefix\$partitionid --no_validation --only_config
  cd ../../..
done
EOF

echo "Creating a script/service to disable hyper-threading..."

cat << EOF > /usr/bin/disable-ht.sh
#!/bin/bash
# disable hyper-threads 
for cpunum in \$(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do
  echo 0 > /sys/devices/system/cpu/cpu\$cpunum/online
done
EOF

chmod 755 /usr/bin/disable-ht.sh

cat << EOF > /etc/systemd/system/disable-ht.service
[Unit]
Description=Disable hyper-threading
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/bin/disable-ht.sh

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
if (( disable_ht==1 )); then
    systemctl enable disable-ht
fi

echo "Creating the monitoring script/service..."
echo "If the 5 minutes load average gets below cores/7 we stop the instance..."

core_partition="`echo $core_partition`"   # trim core_partition
cores=`echo "${core_partition// /+}" | bc`
min_load=`echo "scale=2; $cores/7" | bc`

cat << EOF > /usr/bin/fishtest-monitoring.sh
#!/bin/bash
# The process ext4lazyinit is always in the D state, so it adds 1 to
# the load average.
ext4lazyinit=\`ps -eo pid,stat,comm | grep " D" |grep ext4lazyinit | wc -l\`
min_load=\`echo "scale=2; $min_load+\$ext4lazyinit" | bc\` 
echo cores=$cores min_load=\$min_load  \`date\` \`cat /proc/loadavg\`
not_loaded=\`awk -v min_load=\$min_load '{if (\$2 > min_load) {print 0} else {print 1}}' /proc/loadavg\`
if (( \$not_loaded )); then
  echo "load average too low: powering off"
  sleep 1
  poweroff
fi
EOF

chmod 755 /usr/bin/fishtest-monitoring.sh

cat << EOF > /etc/systemd/system/fishtest-monitoring.service
[Unit]
Description=Fishtest monitoring
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/fishtest-monitoring.sh

[Install]
WantedBy=multi-user.target
EOF

cat << EOF > /etc/systemd/system/fishtest-monitoring.timer
[Unit]
Description=Fishtest monitoring timer
After=multi-user.target

[Timer]
OnActiveSec=300
OnUnitActiveSec=60

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable fishtest-monitoring.timer

# Install the fishtest-workers as systemd service.
# Start/stop a worker with:
# sudo systemctl start fishtest-worker@[nr]
# sudo systemctl stop fishtest-worker@[nr]
# Check the status of a worker with:
# sudo systemctl status fishtest-worker@[nr]
# Check the log with:
# sudo journalctl -u fishtest-worker@[nr]
# The service uses the worker configuration file
# ${worker_user_home}/worker[nr]/fishtest-master/worker/fishtest.cfg.
# Output can be found in
# ${worker_user_home}/worker[nr]/fishtest-master/worker/worker.log.

cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Fishtest worker %i
After=multi-user.target

[Service]
Type=simple
StandardOutput=file:${worker_user_home}/worker%i/fishtest-master/worker/worker.log
StandardError=inherit
ExecStart=python3 -u ${worker_user_home}/worker%i/fishtest-master/worker/worker.py
User=${worker_user}
WorkingDirectory=${worker_user_home}/worker%i/fishtest-master/worker

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

echo "Enabling the worker service..."

partitionid=0
for partition in ${core_partition}
do
  partitionid=$((partitionid+1))
  systemctl enable fishtest-worker@$partitionid
done

echo "Worker installed! Rebooting..."

reboot

Manage your instance

  1. See http://tests.stockfishchess.org/tests . Within a few minutes of your spot instance starting, your username should appear in the list of Active machines, with two workers, one with 14 and another with 3 cores (depending on the choice of partition in the script, and provided there are pending/running tests). The flag will depend on the region you selected in AWS EC2. If your instance does not appear after a couple of minutes, termination might be in order.
  2. The instance will be stopped if no tests are available, or if fishtest is down. In that case you will have to restart it manually.
  3. To manually stop a running instance (and stop billing your credit card), go to the AWS dashboard / instances / instances. You can right click the instance and select instance state / stop.
  4. Note that spot instances can be stopped by AWS at anytime.
  5. To get an overview of your EC2 usage and costs, go to the AWS dashboard / Billing and Cost Management

Terminating Instances and Cancelling Spot Requests

It's crucial to understand that a spot request and an instance are distinct entities. When you create a spot request as described earlier, it will automatically initiate an instance whenever spot price falls below your maximum price.

Important: Terminating a spot instance does not prevent future instances from being launched. This action only terminates the current instance.

To cease future instance launches, you must cancel the spot request itself. To do this, go to "Spot Requests" in the left side menu. Select the checkbox for the spot request you intend to cancel. Then, click on "Actions" and choose "Cancel Request". This will stop any further instances from being initiated under that specific spot request.