The following conditions must be met when the Vega is deployed in a local cluster:
- Ubuntu 18.04 or EulerOS 2.0 SP8
- CUDA 10.0 or CANN 20.1
- Python 3.7 or later
- pytorch, tensorflow(>1.14, <2.0) or mindspore
Note: If you need to deploy the Ascend 910 cluster, contact us.
During cluster deployment, you need to install the Vega:
pip3 install --user --upgrade noah-vega
In addition, you need to install the `MPI' software. For details, see Installing the MPI.
After installing the preceding software on each host, you need to configure SSH mutual trust (#ssh) and build NFS (#nfs).
After the preceding operations are complete, the cluster has been deployed.
-
Use the apt tool to install MPI directly
sudo apt-get install mpi
-
Run the following commandes to check mpi is working.
mpirun
Any two hosts on the network must support SSH mutual trust. The configuration method is as follows:
-
Install SSH.
sudo apt-get install sshd
-
Indicates the public key.
ssh-keygen -t rsa
two file id_rsa, id_rsa.pub will be create in folder ~/.ssh/, id_rsa.pub is public key. -
Check the authorized_keys file in the directory. If the file does not exist, create it and run the chmod 600 ~/.ssh/authorized_keys command to change the permission.
-
Copy the public key id_rsa.pub to the authorized_keys file on other servers.
NFS is a widely used system for data sharing in a cluster. If an NFS system already exists in the cluster, use the existing NFS system.
The following instructions for configuring NFS may not apply to all NFS systems. Adjust the instructions based on the actual cluster environment.
Before configuring the NFS server, check whether the UID of the current user on each host in the cluster are the same. If the UID are different, the NFS shared directory cannot be accessed. In this case, you need to change the UID of the current user to the same value to avoid conflicts with the UIDs of other users.
To query the UID of the current user, run the following command:
id <user name>
Change the current UID (Change the value with caution, please contact the cluster system administrator for help):
sudo usermod <user name> -u <new UID>
NFS server settings:
-
Install the NFS server.
sudo apt install nfs-kernel-server
-
Create a shared directory on the NFS server, for example,
/<user home path>/nfs_cache
.cd ~ mkdir nfs_cache
-
Write the shared directory to the configuration file
/etc/exports
.sudo bash -c "echo '/home/<user home path>/nfs_cache *(rw,sync,no_subtree_check,no_root_squash,all_squash)' >> /etc/exports"
-
Restart the NFS server.
sudo service nfs-kernel-server restart
The NFS client must be configured on each server.
-
Install the client tool.
sudo apt install nfs-common
-
Create a local mount directory.
cd - mkdir -p ./nfs_folder
-
Mount the shared directory.
sudo mount -t nfs < Server ip>:/<user home path>/nfs_cache /<user home path>/nfs_folder
After the mounting is complete, /<user home path>/nfs_folder
is the working directory of the multi-node cluster. Run the Vega program in this directory.