Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with batch MPI examples #29

Open
vsoch opened this issue May 24, 2023 · 1 comment
Open

Issues with batch MPI examples #29

vsoch opened this issue May 24, 2023 · 1 comment

Comments

@vsoch
Copy link

vsoch commented May 24, 2023

Hi!

I am trying to reproduce the simple MPI example here, because actually running an mpi program, because the example here is just running hostname. I have locally two examples - one an application we are working on, and the second a "hello world" example that I fell back to when I hit some issues (and it reproduced them). Here is what my job looks like:

name: "projects/llnl-flux/locations/us-central1/jobs/hello-world-mpi-005"
uid: "hello-world-mpi-00-3f853428-1bba-44c60"
task_groups {
  name: "projects/xxxxxxxxxxxxxxxlocations/us-central1/jobs/hello-world-mpi-005/taskGroups/group0"
  task_spec {
    runnables {
      barrier {
        name: "wait-for-setup"
      }
    }
    runnables {
      script {
        text: "bash /mnt/share/hello-world-mpi/setup.sh"
      }
    }
    runnables {
      barrier {
        name: "wait-for-setup"
      }
    }
    runnables {
      script {
        text: "bash /mnt/share/hello-world-mpi/run.sh"
      }
    }
    compute_resource {
      cpu_milli: 1000
      memory_mib: 1000
    }
    max_run_duration {
      seconds: 3600
    }
    max_retry_count: 2
    volumes {
      gcs {
        remote_path: "netmark-experiment-bucket"
      }
      mount_path: "/mnt/share"
    }
  }
  task_count: 4
  parallelism: 4
  task_count_per_node: 1
  require_hosts_file: true
  permissive_ssh: true
}
allocation_policy {
  location {
    allowed_locations: "regions/us-central1"
    allowed_locations: "zones/us-central1-a"
    allowed_locations: "zones/us-central1-b"
    allowed_locations: "zones/us-central1-c"
    allowed_locations: "zones/us-central1-f"
  }
  instances {
    policy {
      machine_type: "c2-standard-16"
      boot_disk {
        image: "projects/cloud-hpc-image-public/global/images/family/hpc-centos-7"
      }
    }
  }
  service_account {
    email: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  }
  labels {
    key: "batch-job-id"
    value: "hello-world-mpi-005"
  }
}
labels {
  key: "type"
  value: "script"
}
labels {
  key: "mount"
  value: "bucket"
}
labels {
  key: "env"
  value: "testing"
}
status {
  state: QUEUED
  run_duration {
  }
}
create_time {
  seconds: 1684889759
  nanos: 883261744
}
update_time {
  seconds: 1684889759
  nanos: 883261744
}
logs_policy {
  destination: CLOUD_LOGGING
}

And the setup.sh and run.sh scripts

setup.sh

#!/bin/bash
export DEBIAN_FRONTEND=noninteractive
sleep $BATCH_TASK_INDEX

# Note that for this family / image, we are root (do not need sudo)
yum update -y && yum install -y cmake gcc tuned ethtool

# This ONLY works on the hpc-* image family images
google_mpi_tuning --nosmt
# google_install_mpi --intel_mpi
google_install_intelmpi --impi_2021

# This is where they are installed to
# ls /opt/intel/mpi/latest/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release

export PATH=/opt/intel/mpi/latest/bin:$PATH

outdir=/mnt/share/hello-world-mpi
mkdir -p ${outdir}
cd ${outdir}

if [ $BATCH_TASK_INDEX = 0 ]; then
    wget -O /tmp/ompi.tar.gz https://docs.it4i.cz/src/ompi/ompi.tar.gz
    cd /tmp
    tar -xzvf ompi.tar.gz
    rm ompi/Makefile
    cp -R ./ompi/* ${outdir}/
    cd ${outdir}/
    ls
    mpicc -g -lmpi -lmpifort hello_c.c -I/opt/intel/mpi/latest/include -I/opt/intel/mpi/2021.8.0/include -L/opt/intel/mpi/2021.8.0/lib/release -L/opt/intel/mpi/2021.8.0/lib -o hello_c
fi

and run.sh

#!/bin/bash
export PATH=/opt/intel/mpi/latest/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
find /opt/intel -name mpicc

if [ $BATCH_TASK_INDEX = 0 ]; then
  cd /mnt/share/hello-world-mpi
  ls
  mpirun -hostfile $BATCH_HOSTS_FILE -n 4 -ppn 1 -- /mnt/share/hello-world-mpi/hello_c
fi

It looks like it's compiling OK - I see hello_c - but the error I've hit in both with mpirun is something related to hydra and an argument?

image

It's been really challenging figuring out how all this works - e.g., it took me a hot minute to realize that these google install commands for mpi were only available on that specific image family, and then it's taken 10+ jobs to find paths / bins of various things (I'm on my 50+ run and still don't have a working thing!) 😆 I have a lot of feedback I'm planning to share, but would like to get at least one reasonable example working first (and I'd be happy to share)! For my execution, I'm using the python sdk so I don't have the config beyond what I posted above. Thanks for the help - looking forward to getting this working!

@vsoch
Copy link
Author

vsoch commented Jun 6, 2023

heyo! I got everything working - let me know if you are interested in an example here: https://github.com/converged-computing/operator-experiments/tree/main/google/networking/hello-world-mpi. I think this would be important to show folks - the issue is that the install scripts just show a source command for the vars.sh (and it doesn't actually run it) and I suspect many folks will assume it is sourced and run into hours / days of anguish debugging. 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant