Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Runner Group #746

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ jobs:
strategy:
matrix:
device: ['cpu', 'gpu']
lbl: ['gt']
runs-on:
group: phoenix
labels: gt
labels: ${{ matrix.lbl }}
timeout-minutes: 1400
env:
ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
Expand All @@ -46,6 +47,7 @@ jobs:
path: master

- name: Bench (Master v. PR)
if: matrix.lbl == 'phoenix'
run: |
(cd pr && bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/bench.sh ${{ matrix.device }}) &
(cd master && bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/bench.sh ${{ matrix.device }}) &
Expand All @@ -60,7 +62,7 @@ jobs:
uses: actions/upload-artifact@v4
if: always()
with:
name: logs-${{ matrix.device }}
name: logs-${{ matrix.device }}-${{matrix.lbl}}
path: |
pr/bench-${{ matrix.device }}.*
pr/build/benchmarks/*
Expand Down
63 changes: 63 additions & 0 deletions .github/workflows/delta/submit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash

set -e

usage() {
echo "Usage: $0 [script.sh] [cpu|gpu]"
}

if [ ! -z "$1" ]; then
sbatch_script_contents=`cat $1`
else
usage
exit 1
fi

sbatch_cpu_opts="\
#SBATCH -p cpu
#SBATCH --account=bdiy-delta-cpu
"

sbatch_gpu_opts="\
#SBATCH -p gpuA100x4,gpuA100x4-interactive
#SBATCH --account=bdiy-delta-gpu
#SBATCH --gpus-per-node=2
"

if [ "$2" == "cpu" ]; then
sbatch_device_opts="$sbatch_cpu_opts"
elif [ "$2" == "gpu" ]; then
sbatch_device_opts="$sbatch_gpu_opts"
else
usage
exit 1
fi

job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2"

sbatch <<EOT
#!/bin/bash
#SBATCH -Jshb-$job_slug # Job name
#SBATCH -N1 # Number of nodes required
$sbatch_device_opts
#SBATCH -t 01:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -n 20
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH -W # Do not exit until the submitted job terminates.
#SBATCH --constraint="scratch"

set -e
set -x

cd "\$SLURM_SUBMIT_DIR"
echo "Running in $(pwd):"

job_slug="$job_slug"
job_device="$2"

. ./mfc.sh load -c d -m $2

$sbatch_script_contents

EOT

21 changes: 21 additions & 0 deletions .github/workflows/delta/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

build_opts=""
if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

./mfc.sh test --dry-run -j 20 $build_opts

n_test_threads=8

if [ "$job_device" == "gpu" ]; then
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
fi

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/nvhpc-24.1/openmpi-4.1.5-zkiklxi/lib/
./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts -- -c delta

3 changes: 3 additions & 0 deletions .github/workflows/formatting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Check formatting
run: |
./mfc.sh format -j $(nproc)
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/frontier/build.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash

. ./mfc.sh load -c f -m g
./mfc.sh build -j 8 --gpu
./mfc.sh test --dry-run -j 8 --gpu
1 change: 0 additions & 1 deletion .github/workflows/frontier/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,3 @@ gpus=`rocm-smi --showid | awk '{print $1}' | grep -Eo '[0-9]+' | uniq | tr '\n'
ngpus=`echo "$gpus" | tr -d '[:space:]' | wc -c`

./mfc.sh test --max-attempts 3 -j $ngpus -- -c frontier

15 changes: 15 additions & 0 deletions .github/workflows/ice/bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

n_ranks=12

if [ "$job_device" == "gpu" ]; then
n_ranks=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1
device_opts="--gpu -g $gpu_ids"
fi

if ["$job_device" == "gpu"]; then
./mfc.sh bench --mem 12 -j $(nproc) -o "$job_slug.yaml" -- -c phoenix $device_opts -n $n_ranks
else
./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c phoenix $device_opts -n $n_ranks
fi
61 changes: 61 additions & 0 deletions .github/workflows/ice/submit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/bin/bash

set -e

usage() {
echo "Usage: $0 [script.sh] [cpu|gpu]"
}

if [ ! -z "$1" ]; then
sbatch_script_contents=`cat $1`
else
usage
exit 1
fi

sbatch_cpu_opts="\
#SBATCH --ntasks-per-node=20 # Number of cores per node required
"

sbatch_gpu_opts="\
#SBATCH --ntasks-per-node=20 # Number of cores per node required
#SBATCH -G H100:2\
"

if [ "$2" == "cpu" ]; then
sbatch_device_opts="$sbatch_cpu_opts"
elif [ "$2" == "gpu" ]; then
sbatch_device_opts="$sbatch_gpu_opts"
else
usage
exit 1
fi

job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2"

sbatch <<EOT
#!/bin/bash
#SBATCH -Jshb-$job_slug # Job name
#SBATCH -N1 # Number of nodes required
#SBATCH -n 20 # Number of nodes required
$sbatch_device_opts
#SBATCH -t 03:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH -W # Do not exit until the submitted job terminates.
#SBATCH --exclude=atl1-1-02-009-33-0

set -e
set -x

cd "\$SLURM_SUBMIT_DIR"
echo "Running in $(pwd):"

job_slug="$job_slug"
job_device="$2"

. ./mfc.sh load -c p -m $2

$sbatch_script_contents

EOT

19 changes: 19 additions & 0 deletions .github/workflows/ice/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

build_opts=""
if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

if [ "$job_device" == "gpu" ]; then
gpu_count=$(nvidia-smi -L | wc -l) # number of GPUs on node
gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
device_opts="-g $gpu_ids"
n_test_threads=`expr $gpu_count \* 2`
fi

./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
1 change: 1 addition & 0 deletions .github/workflows/line-count.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,5 +49,6 @@ jobs:
cd $BASE
export MFC_PR=$PR
pwd
./mfc.sh init &> tmp.txt
./mfc.sh count_diff

3 changes: 3 additions & 0 deletions .github/workflows/lint-toolchain.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,8 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Lint the toolchain
run: ./mfc.sh lint
2 changes: 1 addition & 1 deletion .github/workflows/phoenix/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ if [ "$job_device" == "gpu" ]; then
build_opts="--gpu"
fi

./mfc.sh build -j 8 $build_opts
./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/spelling.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Spell Check

on: [push, workflow_dispatch]
on: [push, pull_request, workflow_dispatch]

jobs:
run:
Expand All @@ -10,5 +10,8 @@ jobs:
- name: Checkout
uses: actions/checkout@v4

- name: MFC Python setup
run: ./mfc.sh init

- name: Spell Check
run: ./mfc.sh spelling
10 changes: 9 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ jobs:
strategy:
matrix:
device: ['cpu', 'gpu']
lbl: ['gt', 'frontier']
lbl: ['gt', 'delta', 'frontier']
exclude:
- device: cpu
lbl: frontier
Expand All @@ -121,6 +121,14 @@ jobs:
if: matrix.lbl == 'gt'
run: bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/test.sh ${{ matrix.device }}

# - name: Build & Test
# if: matrix.lbl == 'ice'
# run: bash .github/workflows/ice/submit.sh .github/workflows/ice/test.sh ${{ matrix.device }}

- name: Build & Test
if: matrix.lbl == 'delta'
run: bash .github/workflows/delta/submit.sh .github/workflows/delta/test.sh ${{ matrix.device }}

- name: Build
if: matrix.lbl == 'frontier'
run: bash .github/workflows/frontier/build.sh
Expand Down
7 changes: 2 additions & 5 deletions src/simulation/p_main.fpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,13 @@
!! are only available in the volume fraction model.
program p_main

! Dependencies =============================================================

use m_global_parameters !< Definitions of the global parameters
use m_global_parameters

use m_start_up

use m_time_steppers

use m_nvtx
! ==========================================================================

implicit none

Expand Down Expand Up @@ -71,7 +68,7 @@ program p_main
finaltime = t_step_stop*dt
end if

call nvtxEndRange ! INIT
call nvtxEndRange

call nvtxStartRange("SIMULATION-TIME-MARCH")
! Time-stepping Loop =======================================================
Expand Down
1 change: 1 addition & 0 deletions toolchain/mfc/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ def add_common_arguments(p, mask = None):
test.add_argument("-m", "--max-attempts", type=int, default=1, help="Maximum number of attempts to run a test.")
test.add_argument( "--no-build", action="store_true", default=False, help="(Testing) Do not rebuild MFC.")
test.add_argument("--case-optimization", action="store_true", default=False, help="(GPU Optimization) Compile MFC targets with some case parameters hard-coded.")
test.add_argument( "--dry-run", action="store_true", default=False, help="Build and generate case files but do not run tests.")
test_meg = test.add_mutually_exclusive_group()
test_meg.add_argument("--generate", action="store_true", default=False, help="(Test Generation) Generate golden files.")
test_meg.add_argument("--add-new-variables", action="store_true", default=False, help="(Test Generation) If new variables are found in D/ when running tests, add them to the golden files.")
Expand Down
3 changes: 3 additions & 0 deletions toolchain/mfc/run/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

from . import queues, input

import hunter


def __validate_job_options() -> None:
if not ARG("mpi") and any({ARG("nodes") > 1, ARG("tasks_per_node") > 1}):
Expand Down Expand Up @@ -133,6 +135,7 @@ def __execute_job_script(qsystem: queues.QueueSystem):
raise MFCException(f"Submitting batch file for {qsystem.name} failed. It can be found here: {__job_script_filepath()}. Please check the file for errors.")


@hunter.wrap(local=True)
def run(targets = None, case = None):
targets = get_targets(list(REQUIRED_TARGETS) + (targets or ARG("targets")))
case = case or input.load(ARG("input"), ARG("--"))
Expand Down
5 changes: 4 additions & 1 deletion toolchain/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ dependencies = [

# Chemistry
"cantera",
"pyrometheus==1.0.2"
"pyrometheus==1.0.2",

# Logging
"hunter"
]

[tool.hatch.metadata]
Expand Down
Loading