Merge remote-tracking branch 'upstream/master' into dag-execute

skypilot-org · Oct 26, 2024 · 90405ea · 90405ea
2 parents d107a73 + 0e915d3
commit 90405ea
Show file tree

Hide file tree

Showing 50 changed files with 1,238 additions and 690 deletions.
diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
@@ -5,43 +5,30 @@ Managed Jobs
 
 .. tip::
 
-  This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).
+  This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel.
 
-SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
-It can be used in three modes:
+SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any underlying spot preemptions or hardware failures.
+Managed jobs can be used in three modes:
 
-#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
-#. :ref:`On-demand <on-demand>`: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources.
-#. :ref:`Pipelines <pipeline>`: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it.
+#. :ref:`Managed spot jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
+#. :ref:`Managed on-demand/reserved jobs <on-demand>`: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources.
+#. :ref:`Managed pipelines <pipeline>`: Run pipelines that contain multiple tasks (which
+   can have different resource requirements and ``setup``/``run`` commands).
+   Useful for running a sequence of tasks that depend on each other, e.g., data
+   processing, training a model, and then running inference on it.
 
 
 .. _spot-jobs:
 
 Managed Spot Jobs
 -----------------
 
-In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
-Any spot preemptions are automatically handled by SkyPilot without user intervention.
-
-
-Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*:
+In this mode, jobs run on spot instances, and preemptions are auto-recovered by SkyPilot.
 
-.. list-table::
-   :widths: 30 18 12 35
-   :header-rows: 1
+To launch a managed spot job, use :code:`sky jobs launch --use-spot`.
+SkyPilot automatically finds available spot instances across regions and clouds to maximize availability.
+Any spot preemptions are automatically handled by SkyPilot without user intervention.
 
-   * - Command
-     - Managed?
-     - SSH-able?
-     - Best for
-   * - :code:`sky launch --use-spot`
-     - Unmanaged spot cluster
-     - Yes
-     - Interactive dev on spot instances (especially for hardware with low preemption rates)
-   * - :code:`sky jobs launch --use-spot`
-     - Managed spot job (auto-recovery)
-     - No
-     - Scaling out long-running jobs (e.g., data processing, training, batch inference)
 
 Here is an example of a BERT training job failing over different regions across AWS and GCP.
 
@@ -59,6 +46,25 @@ To use managed spot jobs, there are two requirements:
 #. :ref:`Checkpointing <checkpointing>` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket <sky-storage>`. The program can reload the latest checkpoint when restarted.
 
 
+Quick comparison between *managed spot jobs* vs. *launching spot clusters*:
+
+.. list-table::
+   :widths: 30 18 12 35
+   :header-rows: 1
+
+   * - Command
+     - Managed?
+     - SSH-able?
+     - Best for
+   * - :code:`sky jobs launch --use-spot`
+     - Yes, preemptions are auto-recovered
+     - No
+     - Scaling out long-running jobs (e.g., data processing, training, batch inference)
+   * - :code:`sky launch --use-spot`
+     - No, preemptions are not handled
+     - Yes
+     - Interactive dev on spot instances (especially for hardware with low preemption rates)
+
 .. _job-yaml:
 
 Job YAML
@@ -93,7 +99,7 @@ We can launch it with the following:
   setup: |
     # Fill in your wandb key: copy from https://wandb.ai/authorize
     # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
-    # to pass the key in the command line, during `sky spot launch`.
+    # to pass the key in the command line, during `sky jobs launch`.
     echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
 
     pip install -e .
@@ -245,11 +251,11 @@ Real-World Examples
 
 .. _on-demand:
 
-Using On-Demand Instances
---------------------------------
+Managed On-Demand/Reserved Jobs
+-------------------------------
 
 The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering
-on-demand instances. This is useful to have SkyPilot monitor any underlying
+on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying
 machine failures and transparently recover the job.
 
 To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI.
@@ -264,10 +270,10 @@ To do so, simply set :code:`use_spot: false` in the :code:`resources` section, o
   interface, while ``sky launch`` is a cluster interface (that you can launch
   tasks on, albeit not managed).
 
-Either Spot Or On-Demand
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Either Spot or On-Demand/Reserved
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can use ``any_of`` to specify either spot or on-demand instances as
+You can use ``any_of`` to specify either spot or on-demand/reserved instances as
 candidate resources for a job. See documentation :ref:`here
 <multiple-resources>` for more details.
 
@@ -280,12 +286,17 @@ candidate resources for a job. See documentation :ref:`here
       - use_spot: false
 
 In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly
-will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances.
+will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances.
 
 More advanced policies for resource selection, such as the `Can't Be Late
 <https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao>`__ (NSDI'24)
 paper, may be supported in the future.
 
+Running Many Parallel Jobs
+--------------------------
+
+For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`.
+
 Useful CLIs
 -----------
 
@@ -323,11 +334,10 @@ Cancel a managed job:
   If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason
   of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller <job_id>`.
 
-
 .. _pipeline:
 
-Job Pipelines
--------------
+Managed Pipelines
+-----------------
 
 A pipeline is a managed job that contains a sequence of tasks running one after another.
 
@@ -414,8 +424,8 @@ To submit the pipeline, the same command :code:`sky jobs launch` is used. The pi
 
 
 
-Dashboard
----------
+Job Dashboard
+-------------
 
 Use ``sky jobs dashboard`` to open a dashboard to see all jobs:
 

diff --git a/docs/source/reference/faq.rst b/docs/source/reference/faq.rst
@@ -38,7 +38,7 @@ How to ensure my workdir's ``.git`` is synced up for managed spot jobs?
 Currently, there is a difference in whether ``.git`` is synced up depending on the command used:
 
 - For regular ``sky launch``, the workdir's ``.git`` is synced up by default.
-- For managed spot jobs ``sky spot launch``, the workdir's ``.git`` is excluded by default.
+- For managed jobs ``sky jobs launch``, the workdir's ``.git`` is excluded by default.
 
 In the second case, to ensure the workdir's ``.git`` is synced up for managed spot jobs, you can explicitly add a file mount to sync it up:
 

diff --git a/docs/source/reference/kubernetes/kubernetes-deployment.rst b/docs/source/reference/kubernetes/kubernetes-deployment.rst
@@ -147,10 +147,16 @@ Deploying on Google Cloud GKE
    .. code-block:: console
 
        $ sky show-gpus --cloud kubernetes
-       GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-       L4    1, 2, 3, 4    8           6
-       A100  1, 2          4           2
+       GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+       L4    1, 2, 4                   8           6
+       A100  1, 2                      4           2
 
+       Kubernetes per node GPU availability
+       NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+       my-cluster-0               L4        4           4
+       my-cluster-1               L4        4           2
+       my-cluster-2               A100      2           2
+       my-cluster-3               A100      2           0
 
 .. note::
     GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
@@ -196,8 +202,12 @@ Deploying on Amazon EKS
    .. code-block:: console
 
        $ sky show-gpus --cloud kubernetes
-       GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-       A100  1, 2          4           2
+       GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+       A100  1, 2                      4           2
+
+       Kubernetes per node GPU availability
+       NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+       my-cluster-0               A100      2           2
 
 .. _kubernetes-setup-onprem:
 

diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
@@ -156,9 +156,9 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show
 
     $ sky show-gpus --cloud kubernetes
     Kubernetes GPUs
-    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-    L4    1, 2, 4       12          12
-    H100  1, 2, 4, 8    16          16
+    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+    L4    1, 2, 4                   12          12
+    H100  1, 2, 4, 8                16          16
 
     Kubernetes per node GPU availability
     NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
@@ -174,7 +174,12 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show
 
 Using Custom Images
 -------------------
-By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed.
+By default, we maintain and use two SkyPilot container images for use on Kubernetes clusters:
+
+1. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot``: used for CPU-only clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s>`__).
+2. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu``: used for GPU clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s_gpu>`__).
+
+These images are pre-installed with SkyPilot dependencies for fast startup.
 
 To use your own image, add :code:`image_id: docker:<your image tag>` to the :code:`resources` section of your task YAML.
 

diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -262,9 +262,9 @@ You can also check the GPUs available on your nodes by running:
 
     $ sky show-gpus --cloud kubernetes
     Kubernetes GPUs
-    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-    L4    1, 2, 4       12          12
-    H100  1, 2, 4, 8    16          16
+    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+    L4    1, 2, 4                   12          12
+    H100  1, 2, 4, 8                16          16
 
     Kubernetes per node GPU availability
     NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS

diff --git a/docs/source/reservations/existing-machines.rst b/docs/source/reservations/existing-machines.rst
@@ -108,9 +108,9 @@ Deploying SkyPilot
 
       $ sky show-gpus --cloud kubernetes
       Kubernetes GPUs
-      GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-      L4    1, 2, 4       12          12
-      H100  1, 2, 4, 8    16          16
+      GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+      L4    1, 2, 4                   12          12
+      H100  1, 2, 4, 8                16          16
 
       Kubernetes per node GPU availability
       NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS

diff --git a/examples/k8s_cloud_deploy/README.md b/examples/k8s_cloud_deploy/README.md
@@ -44,8 +44,8 @@ NAME              STATUS   ROLES                  AGE   VERSION
 
 $ sky show-gpus --cloud kubernetes
 Kubernetes GPUs
-GPU  QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-A10  1             2           2              
+GPU  REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+A10  1                         2           2              
 
 Kubernetes per node GPU availability
 NODE_NAME        GPU_NAME  TOTAL_GPUS  FREE_GPUS

diff --git a/examples/managed_job_with_storage.yaml b/examples/managed_job_with_storage.yaml
@@ -3,7 +3,7 @@
 # Runs a task that uses cloud buckets for uploading and accessing files.
 #
 # Usage:
-#   sky spot launch -c spot-storage examples/managed_job_with_storage.yaml
+#   sky jobs launch -c spot-storage examples/managed_job_with_storage.yaml
 #   sky down spot-storage
 
 resources:

diff --git a/llm/axolotl/axolotl-spot.yaml b/llm/axolotl/axolotl-spot.yaml
@@ -4,7 +4,7 @@
 #   HF_TOKEN=abc BUCKET=<unique-name> sky launch -c axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET -i30 --down
 #
 #  Managed spot (auto-recovery; for full runs):
-#   HF_TOKEN=abc BUCKET=<unique-name> sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
+#   HF_TOKEN=abc BUCKET=<unique-name> sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
 
 name: axolotl
 

diff --git a/llm/axolotl/readme.md b/llm/axolotl/readme.md
@@ -22,5 +22,5 @@ ssh -L 8888:localhost:8888 axolotl-spot
 
 Launch managed spot instances (auto-recovery; for full runs):
 ```
-HF_TOKEN=abc BUCKET=<unique-name> sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
+HF_TOKEN=abc BUCKET=<unique-name> sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
 ```
diff --git a/llm/falcon/README.md b/llm/falcon/README.md
@@ -1,6 +1,6 @@
 # Finetuning Falcon with SkyPilot
 
-This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT. 
+This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT.
 
 * [Blog post](https://huggingface.co/blog/falcon)
 * [Repo](https://huggingface.co/tiiuae/falcon-40b)
@@ -16,10 +16,10 @@ sky check
 See the Falcon SkyPilot YAML for [training](train.yaml). Serving is currently a work in progress and a YAML will be provided for that soon! We are also working on adding an evaluation step to evaluate the model you finetuned compared to the base model.
 
 ## Running Falcon on SkyPilot
-Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory, 
+Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory,
 but `Falcon-7b-sharded` requires only 40GB memory. Thus,
 * If your GPU has 40 GB memory or less (e.g., Nvidia A100): use `ybelkada/falcon-7b-sharded-bf16`.
-* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`. 
+* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`.
 
 Try `sky show-gpus --all` for supported GPUs.
 
@@ -32,13 +32,13 @@ Steps for training on your cloud(s):
 1. In [train.yaml](train.yaml), set the following variables in `envs`:
 
     - Replace the `OUTPUT_BUCKET_NAME` with a unique name. SkyPilot will create this bucket for you to store the model weights.
-    - Replace the `WANDB_API_KEY` to your own key. 
-    - Replace the `MODEL_NAME` with your desired base model. 
+    - Replace the `WANDB_API_KEY` to your own key.
+    - Replace the `MODEL_NAME` with your desired base model.
 
 2.  **Training the Falcon model using spot instances**:
 
 ```bash
-sky spot launch -n falcon falcon.yaml
+sky jobs launch --use-spot -n falcon falcon.yaml
 ```
 
 Currently, such `A100-80GB:1` spot instances are only available on AWS and GCP.

diff --git a/llm/vicuna-llama-2/README.md b/llm/vicuna-llama-2/README.md
@@ -120,12 +120,12 @@ sky launch --no-use-spot ...
 
 ### Reducing costs by 3x with spot instances
 
-[SkyPilot Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) is a library built on top of SkyPilot that helps users run jobs on spot instances without worrying about interruptions. That is the tool used by the LMSYS organization to train the first version of Vicuna (more details can be found in their [launch blog post](https://lmsys.org/blog/2023-03-30-vicuna/) and [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna)). With this, the training cost can be reduced from $1000 to **\$300**.
+[SkyPilot Managed Jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html) is a library built on top of SkyPilot that helps users run jobs on spot instances without worrying about interruptions. That is the tool used by the LMSYS organization to train the first version of Vicuna (more details can be found in their [launch blog post](https://lmsys.org/blog/2023-03-30-vicuna/) and [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna)). With this, the training cost can be reduced from $1000 to **\$300**.
 
-To use SkyPilot Managed Spot, you can simply replace `sky launch` with `sky spot launch` in the above command:
+To use SkyPilot Managed Spot Jobs, you can simply replace `sky launch` with `sky jobs launch` in the above command:
 
 ```bash
-sky spot launch -n vicuna train.yaml \
+sky jobs launch -n vicuna train.yaml \
   --env ARTIFACT_BUCKET_NAME=<your-bucket-name> \
   --env WANDB_API_KEY=<your-wandb-api-key>
 ```

diff --git a/llm/vicuna/README.md b/llm/vicuna/README.md
@@ -63,14 +63,14 @@ Steps for training on your cloud(s):
 2. **Training the Vicuna-7B model on 8 A100 GPUs (80GB memory) using spot instances**:
 ```bash
 # Launch it on managed spot to save 3x cost
-sky spot launch -n vicuna train.yaml
+sky jobs launch -n vicuna train.yaml
 ```
 Note: if you would like to see the training curve on W&B, you can add `--env WANDB_API_KEY` to the above command, which will propagate your local W&B API key in the environment variable to the job.
 
 [Optional] Train a larger 13B model
 ```
 # Train a 13B model instead of the default 7B
-sky spot launch -n vicuna-7b train.yaml --env MODEL_SIZE=13
+sky jobs launch -n vicuna-7b train.yaml --env MODEL_SIZE=13
 
 # Use *unmanaged* spot instances (i.e., preemptions won't get auto-recovered).
 # Unmanaged spot provides a better interactive development experience but is vulnerable to spot preemptions.

diff --git a/sky/__init__.py b/sky/__init__.py
@@ -128,6 +128,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
 Lambda = clouds.Lambda
 SCP = clouds.SCP
 Kubernetes = clouds.Kubernetes
+K8s = Kubernetes
 OCI = clouds.OCI
 Paperspace = clouds.Paperspace
 RunPod = clouds.RunPod
@@ -143,6 +144,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
     'GCP',
     'IBM',
     'Kubernetes',
+    'K8s',
     'Lambda',
     'OCI',
     'Paperspace',