From fc4d0ac088980b5d63bba6c3b0439d851c87af74 Mon Sep 17 00:00:00 2001 From: dharamveergit Date: Thu, 8 Feb 2024 15:18:51 +0530 Subject: [PATCH 1/4] Add CNAME files for akash.network --- CNAME | 1 + docs/CNAME | 1 + 2 files changed, 2 insertions(+) create mode 100644 CNAME create mode 100644 docs/CNAME diff --git a/CNAME b/CNAME new file mode 100644 index 00000000..09e41266 --- /dev/null +++ b/CNAME @@ -0,0 +1 @@ +akash.network \ No newline at end of file diff --git a/docs/CNAME b/docs/CNAME new file mode 100644 index 00000000..09e41266 --- /dev/null +++ b/docs/CNAME @@ -0,0 +1 @@ +akash.network \ No newline at end of file From 1ac887914957a6bfa2b4b136da08c2b59cd754fa Mon Sep 17 00:00:00 2001 From: dharamveergit Date: Thu, 8 Feb 2024 16:35:56 +0530 Subject: [PATCH 2/4] Update documentation links --- src/content/Docs/deployments/apps-on-akash/index.md | 2 +- src/content/Docs/deployments/overview/index.md | 2 +- src/content/Docs/guides/helium-validator/index.md | 2 +- src/content/Docs/guides/mine-raptoreum-on-akash/index.md | 4 ++-- src/content/Docs/guides/polygon-on-akash/index.md | 2 +- .../guides/tls-termination-of-akash-deployment/index.md | 2 +- src/content/Docs/network-features/fractional-uakt/index.md | 2 +- .../Docs/other-resources/akash-mainnet8-upgrade/index.md | 2 +- .../other-resources/experimental/streamlined-steps/index.md | 2 +- src/content/Docs/other-resources/security/index.md | 2 +- .../build-a-cloud-provider/akash-provider-checkup/index.md | 2 +- .../index.md | 6 +++--- .../ip-leases-provider-enablement/index.md | 2 +- 13 files changed, 16 insertions(+), 16 deletions(-) diff --git a/src/content/Docs/deployments/apps-on-akash/index.md b/src/content/Docs/deployments/apps-on-akash/index.md index 6df6fdd4..532ee05c 100644 --- a/src/content/Docs/deployments/apps-on-akash/index.md +++ b/src/content/Docs/deployments/apps-on-akash/index.md @@ -10,7 +10,7 @@ Awesome Akash is a curated list of awesome resources people can use to familiari **Repository**: [akash-network/awesome-akash](https://github.com/akash-network/awesome-akash) -**Instructions:** [how to deploy](https://akash.network/docs/guides/deploy) the SDL files in this repository +**Instructions:** [how to deploy](/docs/deployments/cloudmos-deploy/) the SDL files in this repository Join our [discord](https://discord.akash.network) if you have questions or concerns. Our team is always eager to hear from you. Also, follow [@akashnet\_](https://twitter.com/akashnet_) to stay in the loop with updates and announcements. diff --git a/src/content/Docs/deployments/overview/index.md b/src/content/Docs/deployments/overview/index.md index 1b3f8808..157fb1e2 100644 --- a/src/content/Docs/deployments/overview/index.md +++ b/src/content/Docs/deployments/overview/index.md @@ -18,4 +18,4 @@ Applications can be deployed onto the Akash network using a platform that best s ### Image Declaration - Avoid using `:latest` image tags as Akash Providers heavily cache images -- Additional info on related SDL declarations is available [here](https://akash.network/docs/readme/stack-definition-language#services) +- Additional info on related SDL declarations is available [here](/docs/getting-started/stack-definition-language/) diff --git a/src/content/Docs/guides/helium-validator/index.md b/src/content/Docs/guides/helium-validator/index.md index b093a9f2..dea79e91 100644 --- a/src/content/Docs/guides/helium-validator/index.md +++ b/src/content/Docs/guides/helium-validator/index.md @@ -34,7 +34,7 @@ You can deploy the validator on Akash using the example deploy.yml. Note that to Either clone this repository or create a `deploy.yml` file. Enter your S3 bucket and IAM credentials into the `env` section. If you have a swarm_key already, make sure this is uploaded to S3 in the same location as S3_KEY_PATH. -Deploy as [per the docs](https://akash.network/docs/guides/deploy) or using a [deploy UI](https://github.com/tombeynon/akash-deploy). +Deploy as [per the docs](/docs/deployments/cloudmos-deploy/) or using a [deploy UI](https://github.com/tombeynon/akash-deploy). Once the container is deployed, check the logs to see your address once the server starts (can take a while). If your swarm_key didn't exist in S3 before, the new one should have been uploaded. Subsequent deploys using the same S3 details will now use the same swarm_key. diff --git a/src/content/Docs/guides/mine-raptoreum-on-akash/index.md b/src/content/Docs/guides/mine-raptoreum-on-akash/index.md index e85d22e4..3aaa4de9 100644 --- a/src/content/Docs/guides/mine-raptoreum-on-akash/index.md +++ b/src/content/Docs/guides/mine-raptoreum-on-akash/index.md @@ -19,7 +19,7 @@ Welcome [**Raptoreum**](https://raptoreum.com) \*\*\*\* miners! [**Akash**](http 2. Install [**Cloudmos Deploy**](https://cloudmos.io/cloud-deploy) \*\*\*\* and import your AKT wallet address from Keplr 3. [**Fund your wallet**](https://github.com/akash-network/awesome-akash/blob/raptoreum/raptoreum-miner/README.md#Quickest-way-to-get-more-AKT) -For additional help we recommend you [**follow our full deployment guide**](https://akash.network/docs/guides/deploy) \*\*\*\* in parallel with this guide. +For additional help we recommend you [**follow our full deployment guide**](/docs/deployments/cloudmos-deploy/) \*\*\*\* in parallel with this guide. ## How does this work? @@ -27,7 +27,7 @@ Akash uses its blockchain to manage your container deployment and accounting. To ## Default wallet -Akash uses [**Keplr**](https://chrome.google.com/webstore/detail/keplr/dmkamcknogkgcdfhhbddcghachkejeap?hl=en) as the desktop wallet. Advanced users can follow the \*\*\*\* [**CLI wallet instructions**](https://akash.network/docs/guides/cli). +Akash uses [**Keplr**](https://chrome.google.com/webstore/detail/keplr/dmkamcknogkgcdfhhbddcghachkejeap?hl=en) as the desktop wallet. Advanced users can follow the \*\*\*\* [**CLI wallet instructions**](/docs/deployments/akash-cli/installation/). ## Quickest way to get more AKT diff --git a/src/content/Docs/guides/polygon-on-akash/index.md b/src/content/Docs/guides/polygon-on-akash/index.md index 9f34802f..7c1bd5e7 100644 --- a/src/content/Docs/guides/polygon-on-akash/index.md +++ b/src/content/Docs/guides/polygon-on-akash/index.md @@ -30,7 +30,7 @@ If you want a deeper understanding of Polygon and Akash, see these architecture ## Akash Application Tools -This [_**guide**_](https://akash.network/docs/guides/deploy) _\*\*\*\*_ provides step by step instructions on how to deploy an app on Akash using a desktop tool named Cloudmos Deploy. +This [_**guide**_](/docs/deployments/cloudmos-deploy/) _\*\*\*\*_ provides step by step instructions on how to deploy an app on Akash using a desktop tool named Cloudmos Deploy. ## Technical Support Channels diff --git a/src/content/Docs/guides/tls-termination-of-akash-deployment/index.md b/src/content/Docs/guides/tls-termination-of-akash-deployment/index.md index 63164241..a356b769 100644 --- a/src/content/Docs/guides/tls-termination-of-akash-deployment/index.md +++ b/src/content/Docs/guides/tls-termination-of-akash-deployment/index.md @@ -40,7 +40,7 @@ Make sure to specify the hostname you control, in this example it is “ghost.ak When you deploy with 80/tcp port exposed in Akash, the nginx-ingress-controller on the provider will automatically get 443/tcp exposed too. This makes Full TLS termination possible. -If you are not familiar with Akash deployments, visit the documentation for the desktop app [Cloudmos Deploy](https://akash.network/docs/guides/deploy) as an easy way to get started. +If you are not familiar with Akash deployments, visit the documentation for the desktop app [Cloudmos Deploy](/docs/deployments/cloudmos-deploy/) as an easy way to get started. ``` --- diff --git a/src/content/Docs/network-features/fractional-uakt/index.md b/src/content/Docs/network-features/fractional-uakt/index.md index 2278cfbc..82b06942 100644 --- a/src/content/Docs/network-features/fractional-uakt/index.md +++ b/src/content/Docs/network-features/fractional-uakt/index.md @@ -16,7 +16,7 @@ In this guide we will use the Cloudmos Deploy tool to launch deployments using f For the purpose of demonstrating the use of fractional uAKT we will utilize the popular Hello World web application and SDL that can be found in the [Awesome Akash repository](https://github.com/akash-network/awesome-akash). The example SDL file will be modified to take advantage of the new fractional uAKT option. -The [Cloudmos Deploy](https://akash.network/docs/guides/deploy) application will be used to launch the deployment. +The [Cloudmos Deploy](/docs/deployments/cloudmos-deploy/) application will be used to launch the deployment. ### **Example Fractional uAKT Use in Cloudmos Deploy** diff --git a/src/content/Docs/other-resources/akash-mainnet8-upgrade/index.md b/src/content/Docs/other-resources/akash-mainnet8-upgrade/index.md index 3fc47eee..13e2f3b9 100644 --- a/src/content/Docs/other-resources/akash-mainnet8-upgrade/index.md +++ b/src/content/Docs/other-resources/akash-mainnet8-upgrade/index.md @@ -294,7 +294,7 @@ helm -n akash-services get values akash-provider | grep -v '^USER-SUPPLIED VALUE helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml ``` -> _**IMPORTANT**_: Make sure your provider is using the latest bid price script! Here is the guide that tells you how you can set it for your akash-provider chart. [https://akash.network/docs/providers/build-a-cloud-provider/akash-cloud-provider-build-with-helm-charts/#step-8---provider-bid-customization](/docs/providers/build-a-cloud-provider/akash-cloud-provider-build-with-helm-charts/#step-8---provider-bid-customization) +> _**IMPORTANT**_: Make sure your provider is using the latest bid price script! Here is the guide that tells you how you can set it for your akash-provider chart. [/docs/providers/build-a-cloud-provider/akash-cloud-provider-build-with-helm-charts/#step-8---provider-bid-customization](/docs/providers/build-a-cloud-provider/akash-cloud-provider-build-with-helm-charts#step-8---provider-bid-customization) ##### 2.4 akash-hostname-operator Chart diff --git a/src/content/Docs/other-resources/experimental/streamlined-steps/index.md b/src/content/Docs/other-resources/experimental/streamlined-steps/index.md index 6c5f2efb..2ae1d664 100644 --- a/src/content/Docs/other-resources/experimental/streamlined-steps/index.md +++ b/src/content/Docs/other-resources/experimental/streamlined-steps/index.md @@ -154,7 +154,7 @@ source env.sh The steps in this section should be followed if you have a pre-existing Akash account that needs to be imported.\ -If you do not have an Akash account and need to create one, follow the steps in this[ guide](https://akash.network/docs/token/keplr) and then proceed with the step below. +If you do not have an Akash account and need to create one, follow the steps in this[ guide](/docs/getting-started/token-and-wallets#keplr-wallet) and then proceed with the step below. ### **Import Pre-Existing Account** diff --git a/src/content/Docs/other-resources/security/index.md b/src/content/Docs/other-resources/security/index.md index 0b146eb2..21a42cc3 100644 --- a/src/content/Docs/other-resources/security/index.md +++ b/src/content/Docs/other-resources/security/index.md @@ -18,7 +18,7 @@ Default certificate lifespan is 365 days from the moment of issuance. This can b ### **How do I limit my trust to Audited Providers?** -Follow the getting started guide, and you will see the [instructions for audited attributes](https://akash.network/docs/guides/deploy#audited-attributes) suggest using only servers **"signed by"** Akash Network. If you deploy today, you will see bids by Equinix servers that audited and signed by Akash Network. By doing this you are trusting [Equinix’s Security Standards and Compliance](https://www.equinix.com/data-centers/design/standards-compliance) and you are trusting Overclock Labs as the auditor to only sign servers that meet those standards. +Follow the getting started guide, and you will see the [instructions for audited attributes](/docs/providers/akash-audites-atributes) suggest using only servers **"signed by"** Akash Network. If you deploy today, you will see bids by Equinix servers that audited and signed by Akash Network. By doing this you are trusting [Equinix’s Security Standards and Compliance](https://www.equinix.com/data-centers/design/standards-compliance) and you are trusting Overclock Labs as the auditor to only sign servers that meet those standards. ### **What are Audited Attributes?** diff --git a/src/content/Docs/providers/build-a-cloud-provider/akash-provider-checkup/index.md b/src/content/Docs/providers/build-a-cloud-provider/akash-provider-checkup/index.md index d6fc92a6..1f5c0e71 100644 --- a/src/content/Docs/providers/build-a-cloud-provider/akash-provider-checkup/index.md +++ b/src/content/Docs/providers/build-a-cloud-provider/akash-provider-checkup/index.md @@ -61,7 +61,7 @@ Launch the Cloudmos Deploy application to complete the sections that follow. ![](../../../assets/akashlyticsCreateDeployment.png) - In our testing we will use the Hello Akash World simple SDL -- Note - this SDL does not specify any attributes. If the list of bids received from the deployment is large and you would like to reduce the list to isolate a bid from your provider a bit easier, consider attribute use as detailed in this [SDL reference](https://akash.network/docs/providers/akash-audited-attributes#attribute-location-within-the-sdl). +- Note - this SDL does not specify any attributes. If the list of bids received from the deployment is large and you would like to reduce the list to isolate a bid from your provider a bit easier, consider attribute use as detailed in this [SDL reference](/docs/providers/akash-audites-atributes#auditor-location-within-the-sdl). - Otherwise process with the deployment with no need for change to the Hello Akash World SDL and pause when you reach reach the Create Lease phase of the deployment ![](../../../assets/akashlyticsHelloWorldSelect.png) diff --git a/src/content/Docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/index.md b/src/content/Docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/index.md index 3a566192..5fbadb07 100644 --- a/src/content/Docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/index.md +++ b/src/content/Docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/index.md @@ -272,7 +272,7 @@ helm install --create-namespace -n rook-ceph rook-ceph rook-release/rook-ceph -- **TESTING / ALL-IN-ONE** > - Update `deviceFilter` to match your disks -> - Change storageClass name from `beta3` to one you are planning to use based on this [table](https://akash.network/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/storage-class-types) +> - Change storageClass name from `beta3` to one you are planning to use based on this [table](/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/#storage-class-types) > - Add your nodes you want the Ceph storage to use the disks on under the `nodes` section; (make sure to change `node1`, `node2`, ... to your K8s node names! > > When planning all-in-one production provider (or a single storage node) with multiple storage drives (minimum 3): @@ -355,8 +355,8 @@ EOF **PRODUCTION** > - Update `deviceFilter` to match your disks -> - Change storageClass name from `beta3` to one you are planning to use based on this [table](https://akash.network/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/storage-class-types) -> - Update `osdsPerDevice` based on this [table](https://akash.network/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/storage-class-types) +> - Change storageClass name from `beta3` to one you are planning to use based on this [table](/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/#storage-class-types) +> - Update `osdsPerDevice` based on this [table](/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/#storage-class-types) > - Add your nodes you want the Ceph storage to use the disks on under the `nodes` section; (make sure to change `node1`, `node2`, ... to your K8s node names! > - When planning a single storage node with multiple storage drives (minimum 3): > - Change `failureDomain` to `osd` diff --git a/src/content/Docs/providers/build-a-cloud-provider/ip-leases-provider-enablement/index.md b/src/content/Docs/providers/build-a-cloud-provider/ip-leases-provider-enablement/index.md index 0a4bf560..f9fdd616 100644 --- a/src/content/Docs/providers/build-a-cloud-provider/ip-leases-provider-enablement/index.md +++ b/src/content/Docs/providers/build-a-cloud-provider/ip-leases-provider-enablement/index.md @@ -116,7 +116,7 @@ kubectl apply -f metallb-config.yaml Based on MetalLB via Kubespray guidance documented [here](https://github.com/kubernetes-sigs/kubespray/blob/v2.20.0/docs/metallb.md) -The Kubespray flags provided bellow should go into your Provider's Kubespray inventory file and under the vars section. Our reference Provider Kubespray inventory file - used during initial Provider Kubernetes cluster build - is located [here](https://akash.network/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/step-4-ansible-inventory#manual-edits-insertions-of-the-hosts.yaml-inventory-file). +The Kubespray flags provided bellow should go into your Provider's Kubespray inventory file and under the vars section. Our reference Provider Kubespray inventory file - used during initial Provider Kubernetes cluster build - is located [here](/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/kubernetes-cluster-for-akash-providers#step-4---ansible-inventory). ``` # akash provider needs metallb pool name set to `default` - https://github.com/akash-network/provider/blob/v0.1.0-rc13/cluster/kube/metallb/client.go#L43 From a03d845dd70919e47a4f56b6096fd1dd12c98a7c Mon Sep 17 00:00:00 2001 From: dharamveergit Date: Thu, 8 Feb 2024 16:41:17 +0530 Subject: [PATCH 3/4] fix --- docs/CNAME | 1 - 1 file changed, 1 deletion(-) delete mode 100644 docs/CNAME diff --git a/docs/CNAME b/docs/CNAME deleted file mode 100644 index 09e41266..00000000 --- a/docs/CNAME +++ /dev/null @@ -1 +0,0 @@ -akash.network \ No newline at end of file From a8c68eb58725f7e3e710a61e5c58b72997cca390 Mon Sep 17 00:00:00 2001 From: dharamveergit Date: Thu, 8 Feb 2024 19:39:25 +0530 Subject: [PATCH 4/4] fix: docs new doc, Update favicon and provider FAQ --- public/favicon.svg | 11 + src/components/base-head.astro | 2 +- src/content/Docs/_sequence.ts | 1 + .../experimental/amd-gpu-support/index.md | 264 ++++++++++++++++++ .../providers/provider-faq-and-guide/index.md | 2 +- 5 files changed, 278 insertions(+), 2 deletions(-) create mode 100644 public/favicon.svg create mode 100644 src/content/Docs/other-resources/experimental/amd-gpu-support/index.md diff --git a/public/favicon.svg b/public/favicon.svg new file mode 100644 index 00000000..1617b2c4 --- /dev/null +++ b/public/favicon.svg @@ -0,0 +1,11 @@ + + + + + + + + + + + diff --git a/src/components/base-head.astro b/src/components/base-head.astro index 21f8dd5c..45080520 100644 --- a/src/components/base-head.astro +++ b/src/components/base-head.astro @@ -13,7 +13,7 @@ const { title, description, image = "/meta-images/home.png" } = Astro.props; - + diff --git a/src/content/Docs/_sequence.ts b/src/content/Docs/_sequence.ts index babaadee..be73a3cf 100644 --- a/src/content/Docs/_sequence.ts +++ b/src/content/Docs/_sequence.ts @@ -257,6 +257,7 @@ export const docsSequence = [ { label: "Experimental", subItems: [ + { label: "AMD GPU Support" }, { label: "Akash Provider Streamlined Build with Rancher K3s" }, { label: "Omnibus", diff --git a/src/content/Docs/other-resources/experimental/amd-gpu-support/index.md b/src/content/Docs/other-resources/experimental/amd-gpu-support/index.md new file mode 100644 index 00000000..76cccfde --- /dev/null +++ b/src/content/Docs/other-resources/experimental/amd-gpu-support/index.md @@ -0,0 +1,264 @@ +--- +categories: ["Other Resources", "Experimental"] +tags: [] +weight: 2 +title: "AMD GPU Support" +linkTitle: "AMD GPU Support" +--- + +## Introduction + +Welcome to the specialized guide designed to assist Akash Providers in enabling AMD GPU support within their Kubernetes clusters. This documentation is particularly crafted for system administrators, developers, and DevOps professionals who manage and operate their Akash Providers. The focus here is to guide you through the process of integrating AMD GPUs into your Kubernetes/Akash setup, ensuring that they can be utilized in the Akash Network. + +Throughout this guide, you will find step-by-step instructions on installing the necessary AMD drivers, configuring Kubernetes to acknowledge and leverage AMD GPUs. + +This documentation is vital for Akash Providers and Clients who aim to deploy advanced workloads such as machine learning models, high-performance computing tasks, or any applications that benefit from GPU acceleration. By following this guide, you will be able to enhance your service offerings on the Akash Network, catering to a wider range of computational needs with AMD GPU support. + +> **NOTE**: To effectively enable AMD GPU support, ensure that your `akash-provider` and `provider-services` (CLI) are updated to version `0.4.9-rc0` or higher. This is a prerequisite for proper integration and functionality of AMD GPUs on your Akash Provider. + +## Limitations + +Current constraints dictate that combining NVIDIA and AMD GPU vendors within the same Kubernetes worker node is not allowed. However, it is permissible to have different GPU vendors within the same Kubernetes cluster, as long as each individual worker node exclusively uses GPUs from a single vendor, be it NVIDIA or AMD. + +- **Vendor Constraint:** Combining NVIDIA and AMD GPUs within the same Kubernetes worker node is not permitted. +- **Vendor Homogeneity:** It is permissible to mix different GPU vendors on the same Kubernetes cluster. However, this mixing is not allowed within a single worker node. +- **Vendor Exclusivity:** Each worker node must exclusively use GPUs from a single vendor, either NVIDIA or AMD. This means a single node cannot have a mix of NVIDIA and AMD GPUs, but different nodes within the same cluster can use different vendors. + +## Installing the AMD GPU Driver + +Follow these steps to install the AMD GPU Driver: + +1. Install AMD GPU drivers using DKMS + +- Apply these commands on your node with AMD GPU: + + > based on https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/ubuntu.html + + ``` + mkdir --parents --mode=0755 /etc/apt/keyrings + + wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \ + gpg --dearmor | tee /etc/apt/keyrings/rocm.gpg > /dev/null + + echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/6.0.1/ubuntu jammy main" \ + | tee /etc/apt/sources.list.d/amdgpu.list + apt update + apt -y install amdgpu-dkms + ``` + +2. Make sure the right driver is loaded: + +- Reboot the node: + By default `/lib/modules//kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko` is loaded, however you cannot simply `modprobe -r amdgpu` and then `modprobe amdgpu`. + You need to reboot to make sure the correct AMD GPU driver (DKMS `/lib/modules//updates/dkms/amdgpu.ko`) is properly loaded. + +- Verify correct version is loaded (you may see a higher version, that's okay): + + ``` + # dmesg -T |grep 'amdgpu version' + [Fri Jan 26 22:47:18 2024] [drm] amdgpu version: 6.3.6 + + # dmesg -T |grep -i 'Initialized amdgpu' + [Fri Jan 26 22:47:19 2024] [drm] Initialized amdgpu 3.56.0 20150101 for 0000:1b:00.0 on minor 1 + ``` + +## Enabling AMD GPU Support in Akash Provider + +### 1. Install `ROCm/k8s-device-plugin` helm-chart + +- Add the helm repository and install the chart: + + ```bash + helm repo add amd-gpu-helm https://rocm.github.io/k8s-device-plugin/ + helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin amd-gpu amd-gpu-helm/amd-gpu --version 0.12.0 + ``` + +- Verify the installation: + + > NOTE: replace `node1` with the node name of your worker node (`kubectl get nodes`) + + ```bash + kubectl -n amd-device-plugin logs ds/amd-gpu-device-plugin-daemonset + kubectl describe node node1 | grep -B1 -Ei 'nvidia.com/gpu|amd.com/gpu' + ``` + +- Example output: + + ```bash + # kubectl -n amd-device-plugin logs ds/amd-gpu-device-plugin-daemonset + I0126 22:47:52.227295 1 main.go:305] AMD GPU device plugin for Kubernetes + I0126 22:47:52.227493 1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704 + I0126 22:47:52.227506 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800 + I0126 22:47:52.227524 1 manager.go:42] Starting device plugin manager + I0126 22:47:52.227543 1 manager.go:46] Registering for system signal notifications + I0126 22:47:52.228216 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory + I0126 22:47:52.228421 1 manager.go:60] Starting Discovery on new plugins + I0126 22:47:52.228446 1 manager.go:66] Handling incoming signals + I0126 22:47:52.228491 1 manager.go:71] Received new list of plugins: [gpu] + I0126 22:47:52.228555 1 manager.go:110] Adding a new plugin "gpu" + I0126 22:47:52.228594 1 plugin.go:64] gpu: Starting plugin server + I0126 22:47:52.228605 1 plugin.go:94] gpu: Starting the DPI gRPC server + I0126 22:47:52.229986 1 plugin.go:112] gpu: Serving requests... + I0126 22:48:02.237090 1 plugin.go:128] gpu: Registering the DPI with Kubelet + I0126 22:48:02.238870 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu + I0126 22:48:02.246025 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:1b:00.0 + I0126 22:48:02.323568 1 main.go:149] Watching GPU with bus ID: 0000:1b:00.0 NUMA Node: [0] + + # kubectl describe node node1 | grep -B1 -Ei 'nvidia.com/gpu|amd.com/gpu' + Capacity: + amd.com/gpu: 1 + -- + Allocatable: + amd.com/gpu: 1 + -- + hugepages-2Mi 0 (0%) 0 (0%) + amd.com/gpu 0 0 + ``` + +### 2. Label AMD GPU Node + +- Label your AMD GPU node (replace `mi210` with your AMD GPU model): + ```bash + kubectl label node node1 akash.network/capabilities.gpu.vendor.amd.model.mi210=true + ``` + +### 3. Test AMD GPU with TensorFlow in Pod + +Before proceeding with the deployment, be aware of the following: + +> **NOTE:** Starting the `alexnet-gpu` may take a considerable amount of time, especially over slow network connections. This delay is due to the large size of the image, approximately `10 GiB`, as detailed on [Docker Hub](https://hub.docker.com/r/rocm/tensorflow/tags). + +To deploy and test the TensorFlow environment on AMD GPUs, follow these steps: + +1. Create the pod using the provided YAML file: + + ```bash + kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/c9fc007f07fca4ea1c495ab57f54e10ffa9e2a6b/example/pod/alexnet-gpu.yaml + ``` + +2. Check the logs to verify successful deployment and operation: + + ```bash + kubectl logs alexnet-tf-gpu-pod + # Expected output includes: + # TensorFlow version information, list of devices (e.g., '/gpu:0'), and performance metrics (e.g., total images/sec). + ``` + + Example output: + + ``` + $ kubectl logs alexnet-tf-gpu-pod + ... + 2024-01-26 22:50:28.771404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63950 MB memory: -> device: 0, name: AMD Instinct MI210, pci bus id: 0000:1b:00.0 + ... + TensorFlow: 2.14 + ... + Devices: ['/gpu:0'] + ... + total images/sec: 5849.31 + ``` + +3. Once testing is complete, delete the pod: + ```bash + kubectl delete pod alexnet-tf-gpu-pod + ``` + +## Update bid pricing parameters for your AMD GPU card + +Make sure you are using the latest bid pricing script. You can follow [these](/docs/providers/provider-faq-and-guide#provider-bid-script-migration---gpu-models) instructions. + +- Add the pricing for your AMD GPU model (replace `mi210` with your model) to `provider.yaml` file: + ```yaml + price_target_gpu_mappings: "mi210=190,*=200" + ``` + +This sets `$190`/month for AMD GPU MI 210 card and defaults to `$200`/month when the GPU model was [not](https://github.com/akash-network/support/issues/166) explicitly set. + +## Testing AMD GPU with TensorFlow in Akash Deployment + +To test TensorFlow with AMD GPU in Akash Deployment: + +- Base your deployment on the image & command/args from the provided YAML file. +- Use image: `rocm/tensorflow`. +- Override the `command` & `args` in the SDL. +- Execute the benchmarking command: + ```bash + python3 /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=alexnet + ``` + +### Example SDL + +Use the following SDL configuration to deploy `rocm/tensorflow` image: + +> Make sure to replace `SSH_PUBKEY` with your public SSH key should you want to be able ssh to your deployment instead of `lease-shell` into it. + +```yaml +--- +version: "2.0" + +services: + app: + image: rocm/tensorflow:latest + env: + - 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII andy' + command: + - "sh" + - "-c" + args: + - 'apt-get update; + apt-get install -y --no-install-recommends -- ssh speedtest-cli netcat-openbsd curl wget ca-certificates jq less iproute2 iputils-ping vim bind9-dnsutils nginx; + mkdir -p -m0755 /run/sshd; + mkdir -m700 ~/.ssh; + echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys; + chmod 0600 ~/.ssh/authorized_keys; + ls -lad ~ ~/.ssh ~/.ssh/authorized_keys; + md5sum ~/.ssh/authorized_keys; + exec /usr/sbin/sshd -D' + expose: + - port: 80 + as: 80 + to: + - global: true + - port: 22 + as: 22 + to: + - global: true + + +profiles: + compute: + app: + resources: + cpu: + units: 2 + memory: + size: 8Gi + storage: + size: 25Gi + gpu: + units: 1 + attributes: + vendor: + amd: + - model: mi210 + placement: + akash: + attributes: + host: akash + pricing: + app: + denom: uakt + amount: 1000000 + +deployment: + app: + akash: + profile: app + count: 1 +``` + +## Additional material + +- Exploring Integration of `rocm-smi` in AMD GPU Pods for Enhanced Compatibility + +We are [exploring](https://github.com/ROCm/k8s-device-plugin/issues/44) the possibility of including the `rocm-smi` tool by default in AMD GPU Pods, analogous to how `nvidia-smi` is available in NVIDIA GPU Pods. This inclusion in NVIDIA pods is facilitated by the NVIDIA device plugin, which mounts necessary host paths and utilizes environment variables like `NVIDIA_DRIVER_CAPABILITIES`. For more detailed examples and information, refer to the NVIDIA Container Toolkit documentation [here](https://github.com/NVIDIA/nvidia-container-toolkit/blob/a2262d00cc6d98ac2e95ae2f439e699a7d64dc17/cmd/nvidia-container-runtime/README.md?plain=1#L147-L161). Our goal is to achieve similar functionality for AMD GPUs, enhancing user experience. diff --git a/src/content/Docs/providers/provider-faq-and-guide/index.md b/src/content/Docs/providers/provider-faq-and-guide/index.md index b40f8218..f4597ffd 100644 --- a/src/content/Docs/providers/provider-faq-and-guide/index.md +++ b/src/content/Docs/providers/provider-faq-and-guide/index.md @@ -23,7 +23,7 @@ The guide is broken down into the following categories: - [Force New ReplicaSet Workaround](#force-new-replicaset-workaround) - [Kill Zombie Processes](#kill-zombie-processes) - [Close Leases Based on Image](#close-leases-based-on-image) -- [Provider Bid Script Migration for GPU Model Pricing](#provider-bid-script-migration-gpu-models) +- [Provider Bid Script Migration for GPU Model Pricing](#provider-bid-script-migration---gpu-models) - [GPU Provider Troubleshooting](#gpu-provider-troubleshooting) ## Provider Maintenance