Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/slurm remote support #250

Draft
wants to merge 30 commits into
base: develop
Choose a base branch
from

Conversation

mhrtmnn
Copy link
Contributor

@mhrtmnn mhrtmnn commented Nov 5, 2020

This pull request extends the SLURM support of tapasco such that remote compute nodes can be used for carrying out HLS and compose jobs.

The required architecture consist of three networked machines:

  • Host (front end):
    Runs a tapasco instance that takes in the user CLI arguments and collects all required files for the selected job (e.g. kernel source files for HLS jobs or IPCores for compose jobs). These dependencies are copied over the network to a separate node referred to as Workstation. The artefacts that are generated by a job (e.g. IPCore for HLS, bitstream for compose) are copied back to the Host once the job finishes.

  • Workstation:
    In the simplest case a network attached storage. It is required, since in the general case we cannot directly push files to the SLURM compute node. Thus, the files are deposited in a known directory on this node, and the SLURM compute node can pull the files from here by itself.

  • SLURM node (back end):
    Login node to the compute node that has SLURM control tools such as sbatch and squeue installed. The compute node runs its own tapasco instance.

The above setup is configurable through a JSON config file. This PR contains an example file at toolflow/vivado/common/SLURM/ESA.json that describes an ESA internal compute node. Different configurations can be selected via tapasco CLI options at the Host, for example --slurm ESA.

@mhrtmnn mhrtmnn requested a review from lukasmweber November 5, 2020 21:38
@mhrtmnn mhrtmnn linked an issue Nov 28, 2020 that may be closed by this pull request
@mhrtmnn mhrtmnn force-pushed the feature/SlurmRemoteSupport branch from f1e7781 to b7c0a20 Compare December 14, 2020 12:37
Previously, a job would be broken into its tasks, and a new tapasco
job would be created for each task. These jobs were then executed on the
SLURM cluster. Refactor this, such that the original job is executed on
the SLURM cluster as-is, which simplifies the SLURM logic.
Since SLURM cluster now processes whole jobs (instead of single tasks),
dependencies (preamble) and produced artefacts (postamble) of multiple
platform/architecture pairs may need to be transferred.
@mhrtmnn
Copy link
Contributor Author

mhrtmnn commented Dec 29, 2020

Executing Tapasco in SLURM mode, e.g. tapasco --slurm ESA hls arraysum -p pynq, assumes a working Tapasco installation on the SLURM node.

Tapasco can be installed via a SLURM job script like the following:

#!/bin/bash -xe
#SBATCH -e       /net/balin/Slurm/SLURM_stderr.txt
#SBATCH -o       /net/balin/Slurm/SLURM_stdout.txt
#SBATCH --output=/net/balin/Slurm/SLURM_output.txt

# Clean install?
# rm -rf /scratch/SLURM/

echo "Check if installation exists"
if [ ! -d "/scratch/SLURM/tapasco" ]
then
	mkdir -p /scratch/SLURM
	cd /scratch/SLURM
	git clone https://github.com/esa-tu-darmstadt/tapasco.git --depth=1
fi

echo "Check if workdir exists"
if [ ! -d "/scratch/SLURM/tapasco_workdir" ]
then
	mkdir -p /scratch/SLURM/tapasco_workdir
	cd /scratch/SLURM/tapasco_workdir
	bash /scratch/SLURM/tapasco/tapasco-init.sh
	source tapasco-setup.sh

	echo "Building toolflow"
	cd ${TAPASCO_HOME_TOOLFLOW}/scala
	./gradlew --project-cache-dir=/tmp -g /tmp installDist
fi
echo "Checking installation"
cd /scratch/SLURM/tapasco_workdir
source tapasco-setup.sh
tapasco --help

Note: Building toolflow via tapasco-build-toolflow does not work, as gradlew will complain about missing write permissions to the home directory.

@mhrtmnn mhrtmnn marked this pull request as draft February 18, 2021 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate and Improve the TaPaSCo SLURM Support
1 participant