The following conditions must be met when the Vega is deployed in a local cluster:
- Ubuntu 18.04 or later
- CUDA 10.0
- Python 3.7
- pip
Note: If you need to deploy the Ascend 910 cluster, contact us.
During cluster deployment, you need to install the Vega and some mandatory software packages on each cluster node by running the following commands:
pip3 install --user --upgrade noah-vega
In addition, you need to install the `MPI' software. For details, see Installing the MPI.
After installing the preceding software on each host, you need to configure SSH mutual trust (#ssh) and build NFS (#nfs).
After the preceding operations are complete, the cluster has been deployed.
After the cluster is deployed, run the following command to check whether the cluster is available:
vega-verify-cluster -m <master IP> -s <slave IP 1> <slave IP 2> ... -n <shared NFS folder>
For example:
vega-verify-cluster -m 192.168.0.2 -s 192.168.0.3 192.168.0.4 -n /home/alan/nfs_folder
After the verification is complete, the message "All cluster check items have passed." is displayed. If an error occurs during the verification, please adjust the cluster based on the exception information.
-
Use the apt tool to install MPI directly
sudo apt-get install mpi
-
Run the following commandes to check mpi is working.
mpirun
Any two hosts on the network must support SSH mutual trust. The configuration method is as follows:
-
Install SSH.
sudo apt-get install sshd
-
Indicates the public key.
ssh-keygen -t rsa
two file id_rsa, id_rsa.pub will be create in folder ~/.ssh/, id_rsa.pub is public key. -
Check the authorized_keys file in the directory. If the file does not exist, create it and run the chmod 600 ~/.ssh/authorized_keys command to change the permission.
-
Copy the public key id_rsa.pub to the authorized_keys file on other servers.
NFS server settings:
-
Install the NFS server.
sudo apt install nfs-kernel-server
-
Create a shared directory on the NFS server, for example,
/<user home path>/nfs_cache
.cd ~ mkdir nfs_cache
-
Write the shared directory to the configuration file
/etc/exports
.sudo bash -c "echo '/home/<user home path>/nfs_cache *(rw,sync,no_subtree_check,no_root_squash,all_squash)' >> /etc/exports"
-
Set the shared directory to the
nobody
user.sudo chown -R nobody: /<user home path>/nfs_cache
-
Restart the NFS server.
sudo service nfs-kernel-server restart
The NFS client must be configured on each server.
-
Install the client tool.
sudo apt install nfs-common
-
Create a local mount directory.
cd - mkdir -p ./nfs_folder
-
Mount the shared directory.
sudo mount -t nfs < Server ip>:/<user home path>/nfs_cache /<user home path>/nfs_folder
After the mounting is complete, /<user home path>/nfs_folder
is the working directory of the multi-node cluster. Run the Vega program in this directory.