Skip to content

Commit

Permalink
Bump ubuntu 22.04 + fix CI mlflow tests (#3716)
Browse files Browse the repository at this point in the history
Co-authored-by: v-chen_data <[email protected]>
  • Loading branch information
KuuCi and v-chen_data authored Nov 19, 2024
1 parent 00c927e commit 890d9ee
Show file tree
Hide file tree
Showing 7 changed files with 65 additions and 49 deletions.
9 changes: 6 additions & 3 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
ARG CUDA_VERSION=11.3.1

# Calculate the base image based on CUDA_VERSION
ARG BASE_IMAGE=${CUDA_VERSION:+"nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04"}
ARG BASE_IMAGE=${BASE_IMAGE:-"ubuntu:20.04"}
ARG BASE_IMAGE=${CUDA_VERSION:+"nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu22.04"}
ARG BASE_IMAGE=${BASE_IMAGE:-"ubuntu:22.04"}

# The Python version to install
ARG PYTHON_VERSION=3.10
Expand Down Expand Up @@ -251,7 +251,7 @@ ARG MOFED_VERSION

RUN if [ -n "$MOFED_VERSION" ] ; then \
wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | sudo apt-key add - && \
wget -P /etc/apt/sources.list.d/ http://linux.mellanox.com/public/repo/mlnx_ofed/$MOFED_VERSION/ubuntu20.04/mellanox_mlnx_ofed.list && \
wget -P /etc/apt/sources.list.d/ http://linux.mellanox.com/public/repo/mlnx_ofed/$MOFED_VERSION/ubuntu22.04/mellanox_mlnx_ofed.list && \
apt-get update && \
apt-get install -y mlnx-ofed-dpdk-upstream-libs-user-only ; \
fi
Expand Down Expand Up @@ -325,6 +325,9 @@ RUN pip install --no-cache-dir --upgrade \
urllib3${URLLIB3_VERSION} \
python-snappy

RUN apt-get remove -y python3-blinker
RUN pip install blinker

##################################################
# Override NVIDIA mistaken env var for 11.8 images
##################################################
Expand Down
18 changes: 9 additions & 9 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ To install composer, once inside the image, run `pip install mosaicml`.
<!-- BEGIN_PYTORCH_BUILD_MATRIX -->
| Linux Distro | Flavor | PyTorch Version | CUDA Version | Python Version | Docker Tags |
|----------------|----------|-------------------|---------------------|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Ubuntu 20.04 | Base | 2.5.1 | 12.4.1 (Infiniband) | 3.11 | `mosaicml/pytorch:latest`, `mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu20.04` |
| Ubuntu 20.04 | Base | 2.5.1 | 12.4.1 (EFA) | 3.11 | `mosaicml/pytorch:latest-aws`, `mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu20.04-aws` |
| Ubuntu 20.04 | Base | 2.5.1 | cpu | 3.11 | `mosaicml/pytorch:latest_cpu`, `mosaicml/pytorch:2.5.1_cpu-python3.11-ubuntu20.04` |
| Ubuntu 20.04 | Base | 2.4.1 | 12.4.1 (Infiniband) | 3.11 | `mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu20.04` |
| Ubuntu 20.04 | Base | 2.4.1 | 12.4.1 (EFA) | 3.11 | `mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu20.04-aws` |
| Ubuntu 20.04 | Base | 2.4.1 | cpu | 3.11 | `mosaicml/pytorch:2.4.1_cpu-python3.11-ubuntu20.04` |
| Ubuntu 20.04 | Base | 2.3.1 | 12.1.1 (Infiniband) | 3.11 | `mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04` |
| Ubuntu 20.04 | Base | 2.3.1 | 12.1.1 (EFA) | 3.11 | `mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04-aws` |
| Ubuntu 20.04 | Base | 2.3.1 | cpu | 3.11 | `mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu20.04` |
| Ubuntu 22.04 | Base | 2.5.1 | 12.4.1 (Infiniband) | 3.11 | `mosaicml/pytorch:latest`, `mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu22.04` |
| Ubuntu 22.04 | Base | 2.5.1 | 12.4.1 (EFA) | 3.11 | `mosaicml/pytorch:latest-aws`, `mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu22.04-aws` |
| Ubuntu 22.04 | Base | 2.5.1 | cpu | 3.11 | `mosaicml/pytorch:latest_cpu`, `mosaicml/pytorch:2.5.1_cpu-python3.11-ubuntu22.04` |
| Ubuntu 22.04 | Base | 2.4.1 | 12.4.1 (Infiniband) | 3.11 | `mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu22.04` |
| Ubuntu 22.04 | Base | 2.4.1 | 12.4.1 (EFA) | 3.11 | `mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu22.04-aws` |
| Ubuntu 22.04 | Base | 2.4.1 | cpu | 3.11 | `mosaicml/pytorch:2.4.1_cpu-python3.11-ubuntu22.04` |
| Ubuntu 22.04 | Base | 2.3.1 | 12.1.1 (Infiniband) | 3.11 | `mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu22.04` |
| Ubuntu 22.04 | Base | 2.3.1 | 12.1.1 (EFA) | 3.11 | `mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu22.04-aws` |
| Ubuntu 22.04 | Base | 2.3.1 | cpu | 3.11 | `mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu22.04` |
<!-- END_PYTORCH_BUILD_MATRIX -->

**Note**: The `mosaicml/pytorch:latest`, `mosaicml/pytorch:latest_cpu`, and `mosaicml/pytorch:latest-aws`
Expand Down
58 changes: 29 additions & 29 deletions docker/build_matrix.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This file is automatically generated by generate_build_matrix.py. DO NOT EDIT!
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
IMAGE_NAME: torch-2-5-1-cu124
MOFED_VERSION: latest-23.10
Expand All @@ -10,14 +10,14 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.5.1
TAGS:
- mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cu124-python3.11-ubuntu20.04
- mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cu124-python3.11-ubuntu22.04
- mosaicml/pytorch:latest
- ghcr.io/databricks-mosaic/pytorch:latest
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
IMAGE_NAME: torch-2-5-1-cu124-aws
MOFED_VERSION: ''
Expand All @@ -27,14 +27,14 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.5.1
TAGS:
- mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu20.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cu124-python3.11-ubuntu20.04-aws
- mosaicml/pytorch:2.5.1_cu124-python3.11-ubuntu22.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cu124-python3.11-ubuntu22.04-aws
- mosaicml/pytorch:latest-aws
- ghcr.io/databricks-mosaic/pytorch:latest-aws
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:20.04
BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
IMAGE_NAME: torch-2-5-1-cpu
MOFED_VERSION: ''
Expand All @@ -44,14 +44,14 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.5.1
TAGS:
- mosaicml/pytorch:2.5.1_cpu-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cpu-python3.11-ubuntu20.04
- mosaicml/pytorch:2.5.1_cpu-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.5.1_cpu-python3.11-ubuntu22.04
- mosaicml/pytorch:latest_cpu
- ghcr.io/databricks-mosaic/pytorch:latest_cpu
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.20.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
IMAGE_NAME: torch-2-4-1-cu124
MOFED_VERSION: latest-23.10
Expand All @@ -61,12 +61,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.4.1
TAGS:
- mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cu124-python3.11-ubuntu20.04
- mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cu124-python3.11-ubuntu22.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
CUDA_VERSION: 12.4.1
IMAGE_NAME: torch-2-4-1-cu124-aws
MOFED_VERSION: ''
Expand All @@ -76,12 +76,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.4.1
TAGS:
- mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu20.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cu124-python3.11-ubuntu20.04-aws
- mosaicml/pytorch:2.4.1_cu124-python3.11-ubuntu22.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cu124-python3.11-ubuntu22.04-aws
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:20.04
BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
IMAGE_NAME: torch-2-4-1-cpu
MOFED_VERSION: ''
Expand All @@ -91,12 +91,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.4.1
TAGS:
- mosaicml/pytorch:2.4.1_cpu-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cpu-python3.11-ubuntu20.04
- mosaicml/pytorch:2.4.1_cpu-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.4.1_cpu-python3.11-ubuntu22.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.19.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
CUDA_VERSION: 12.1.1
IMAGE_NAME: torch-2-3-1-cu121
MOFED_VERSION: latest-23.10
Expand All @@ -119,12 +119,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.3.1
TAGS:
- mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cu121-python3.11-ubuntu20.04
- mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cu121-python3.11-ubuntu22.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.18.1
- AWS_OFI_NCCL_VERSION: v1.11.0-aws
BASE_IMAGE: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
CUDA_VERSION: 12.1.1
IMAGE_NAME: torch-2-3-1-cu121-aws
MOFED_VERSION: ''
Expand All @@ -147,12 +147,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.3.1
TAGS:
- mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu20.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cu121-python3.11-ubuntu20.04-aws
- mosaicml/pytorch:2.3.1_cu121-python3.11-ubuntu22.04-aws
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cu121-python3.11-ubuntu22.04-aws
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.18.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:20.04
BASE_IMAGE: ubuntu:22.04
CUDA_VERSION: ''
IMAGE_NAME: torch-2-3-1-cpu
MOFED_VERSION: ''
Expand All @@ -162,12 +162,12 @@
PYTORCH_NIGHTLY_VERSION: ''
PYTORCH_VERSION: 2.3.1
TAGS:
- mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu20.04
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cpu-python3.11-ubuntu20.04
- mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu22.04
- ghcr.io/databricks-mosaic/pytorch:2.3.1_cpu-python3.11-ubuntu22.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.18.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
COMPOSER_INSTALL_COMMAND: mosaicml[all]==0.27.0
CUDA_VERSION: 12.4.1
IMAGE_NAME: composer-0-27-0
Expand All @@ -185,7 +185,7 @@
TARGET: composer_stage
TORCHVISION_VERSION: 0.20.1
- AWS_OFI_NCCL_VERSION: ''
BASE_IMAGE: ubuntu:20.04
BASE_IMAGE: ubuntu:22.04
COMPOSER_INSTALL_COMMAND: mosaicml[all]==0.27.0
CUDA_VERSION: ''
IMAGE_NAME: composer-0-27-0-cpu
Expand Down
12 changes: 6 additions & 6 deletions docker/generate_build_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ def _get_torchvision_version(pytorch_version: str):

def _get_base_image(cuda_version: str):
if not cuda_version:
return 'ubuntu:20.04'
return 'ubuntu:22.04'
if cuda_version == '12.4.1':
return f'nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04'
return f'nvidia/cuda:{cuda_version}-cudnn8-devel-ubuntu20.04'
return f'nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04'
return f'nvidia/cuda:{cuda_version}-cudnn8-devel-ubuntu22.04'


def _get_cuda_version(pytorch_version: str, use_cuda: bool):
Expand Down Expand Up @@ -112,8 +112,8 @@ def _get_pytorch_tags(python_version: str, pytorch_version: str, cuda_version: s
tags = []
cuda_version_tag = _get_cuda_version_tag(cuda_version)
tags += [
f'{base_image_name}:{pytorch_version}_{cuda_version_tag}-python{python_version}-ubuntu20.04',
f'{ghcr_base_image_name}:{pytorch_version}_{cuda_version_tag}-python{python_version}-ubuntu20.04',
f'{base_image_name}:{pytorch_version}_{cuda_version_tag}-python{python_version}-ubuntu22.04',
f'{ghcr_base_image_name}:{pytorch_version}_{cuda_version_tag}-python{python_version}-ubuntu22.04',
]

if python_version == PRODUCTION_PYTHON_VERSION and pytorch_version == PRODUCTION_PYTORCH_VERSION:
Expand Down Expand Up @@ -294,7 +294,7 @@ def _main():
interconnect = 'EFA'
cuda_version = f"{entry['CUDA_VERSION']} ({interconnect})" if entry['CUDA_VERSION'] else 'cpu'
table.append([
'Ubuntu 20.04', # Linux distro
'Ubuntu 22.04', # Linux distro
'Base', # Flavor
entry['PYTORCH_VERSION'], # Pytorch version
cuda_version, # Cuda version
Expand Down
8 changes: 8 additions & 0 deletions tests/fixtures/autouse_fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,11 @@ def remove_run_name_env_var():
os.environ['COMPOSER_RUN_NAME'] = composer_run_name
if run_name is not None:
os.environ['RUN_NAME'] = run_name


@pytest.fixture(autouse=True)
def setup_mlflow_tracking(monkeypatch, tmp_path):
mlflow = pytest.importorskip('mlflow')
tracking_uri = str(tmp_path / 'mlruns')
monkeypatch.setenv(mlflow.environment_variables.MLFLOW_TRACKING_URI.name, tracking_uri)
os.makedirs(tracking_uri, exist_ok=True)
5 changes: 5 additions & 0 deletions tests/utils/object_store/test_mlflow_object_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@
DEFAULT_PATH = TEST_PATH_FORMAT.format(experiment_id=EXPERIMENT_ID, run_id=RUN_ID)


@pytest.fixture(autouse=True)
def setup_mlflow_tracking(monkeypatch):
monkeypatch.setenv('MLFLOW_TRACKING_URI', 'databricks')


def test_parse_dbfs_path():
full_artifact_path = DEFAULT_PATH + ARTIFACT_PATH
assert MLFlowObjectStore.parse_dbfs_path(full_artifact_path) == (EXPERIMENT_ID, RUN_ID, ARTIFACT_PATH)
Expand Down
4 changes: 2 additions & 2 deletions tests/utils/test_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ def test_busy_wait_for_local_rank_zero(tmp_path):

dist.barrier()
start_time = time.time()
assert os.listdir(gathered_tmp_path) == []
assert os.listdir(gathered_tmp_path) == ['mlruns']
with dist.busy_wait_for_local_rank_zero(gathered_tmp_path):
if dist.get_local_rank() == 0:
time.sleep(0.5)

end_time = time.time()
total_time = end_time - start_time
gathered_times = dist.all_gather_object(total_time)
assert os.listdir(gathered_tmp_path) == []
assert os.listdir(gathered_tmp_path) == ['mlruns']
assert len(gathered_times) == 2
assert abs(gathered_times[0] - gathered_times[1]) < 0.1

0 comments on commit 890d9ee

Please sign in to comment.