Skip to content

Commit

Permalink
Remove all references to LSF (#780)
Browse files Browse the repository at this point in the history
Remove all references to LSF and LSB components, including
`JsrunSettings`, `BsubBatchSettings`, and so on.

[ committed by @al-rigazzi ]
  • Loading branch information
al-rigazzi authored Dec 10, 2024
1 parent 4701e8c commit f509ad6
Show file tree
Hide file tree
Showing 45 changed files with 56 additions and 2,706 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ jobs:
fail-fast: false
matrix:
subset: [backends, slow_tests, group_a, group_b]
os: [macos-12, macos-14, ubuntu-22.04] # Operating systems
os: [macos-14, ubuntu-22.04] # Operating systems
compiler: [8] # GNU compiler version
rai: [1.2.7] # Redis AI versions
py_v: ["3.9", "3.10", "3.11"] # Python versions
Expand Down
4 changes: 2 additions & 2 deletions .wci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
Machine Learning (ML) libraries, like PyTorch and TensorFlow,
in combination with High Performance Computing (HPC) simulations and applications.
SmartSim launches ML infrastructure on HPC systems alongside user workloads
and supports most HPC workload managers (e.g. Slurm, PBSPro, LSF).
and supports most HPC workload managers (e.g. Slurm, PBSPro, SGE).
SmartSim also provides a set of client libraries in Python, C++, C, and Fortran.
These client libraries allow users to send and receive data between user
applications and the machine learning infrastructure. Moreover, the
Expand Down Expand Up @@ -40,7 +40,7 @@
resource_managers:
- Slurm
- PBSPro
- LSF
- SGE
- Linux/MacOS
transfer_protocols:
- TCP/IP
Expand Down
12 changes: 3 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,6 @@ SmartSim](https://www.craylabs.org/docs/api/smartsim_api.html#settings).
- ``MpirunSettings``
- ``SrunSettings``
- ``AprunSettings``
- ``JsrunSettings``

The following example launches a hello world MPI program using the local launcher
for single compute node, workstations and laptops.
Expand Down Expand Up @@ -177,7 +176,7 @@ SmartSim integrates with common HPC schedulers providing batch and interactive
launch capabilities for all applications:

- Slurm
- LSF
- SGE
- PBSPro
- Local (for laptops/single node, no batch)

Expand All @@ -197,11 +196,9 @@ salloc -N 3 --ntasks-per-node=20 --ntasks 60 --exclusive -t 00:10:00
# get interactive allocation (PBS)
qsub -l select=3:ncpus=20 -l walltime=00:10:00 -l place=scatter -I -q <queue>

# get interactive allocation (LSF)
bsub -Is -W 00:10 -nnodes 3 -P <project> $SHELL
```

This same script will run on a SLURM, PBS, or LSF system as the ``launcher``
This same script will run on a SLURM, PBS, or SGE system as the ``launcher``
is set to `auto` in the [Experiment](https://www.craylabs.org/docs/api/smartsim_api.html#experiment)
initialization. The run command like ``mpirun``,
``aprun`` or ``srun`` will be automatically detected from what is available on the
Expand Down Expand Up @@ -281,7 +278,7 @@ python hello_ensemble.py
```

Similar to the interactive example, this same script will run on a SLURM, PBS,
or LSF system as the ``launcher`` is set to `auto` in the
or SGE system as the ``launcher`` is set to `auto` in the
[Experiment](https://www.craylabs.org/docs/api/smartsim_api.html#experiment)
initialization. Local launching does not support batch workloads.

Expand Down Expand Up @@ -343,9 +340,6 @@ salloc -N 3 --ntasks-per-node=1 --exclusive -t 00:10:00
# get interactive allocation (PBS)
qsub -l select=3:ncpus=1 -l walltime=00:10:00 -l place=scatter -I -q queue

# get interactive allocation (LSF)
bsub -Is -W 00:10 -nnodes 3 -P project $SHELL

```

```python
Expand Down
17 changes: 2 additions & 15 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@
from smartsim.settings import (
AprunSettings,
DragonRunSettings,
JsrunSettings,
MpiexecSettings,
MpirunSettings,
PalsMpiexecSettings,
Expand Down Expand Up @@ -120,7 +119,7 @@ def print_test_configuration() -> None:

def pytest_configure() -> None:
pytest.test_launcher = test_launcher
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals", "dragon", "sge"]
pytest.wlm_options = ["slurm", "pbs", "pals", "dragon", "sge"]
account = get_account()
pytest.test_account = account
pytest.test_device = test_device
Expand Down Expand Up @@ -386,15 +385,10 @@ def get_base_run_settings(
run_args = {"--np": ntasks, "--hostfile": host_file}
run_args.update(kwargs)
return RunSettings(exe, args, run_command="mpiexec", run_args=run_args)
if test_launcher == "lsf":
run_args = {"--np": ntasks, "--nrs": nodes}
run_args.update(kwargs)
settings = RunSettings(exe, args, run_command="jsrun", run_args=run_args)
return settings
if test_launcher != "local":
raise SSConfigError(
"Base run settings are available for Slurm, PBS, "
f"and LSF, but launcher was {test_launcher}"
f"and Dragon, but launcher was {test_launcher}"
)
# TODO allow user to pick aprun vs MPIrun
return RunSettings(exe, args)
Expand Down Expand Up @@ -429,13 +423,6 @@ def get_run_settings(
run_args = {"np": ntasks, "hostfile": host_file}
run_args.update(kwargs)
return PalsMpiexecSettings(exe, args, run_args=run_args)
if test_launcher == "lsf":
run_args = {
"nrs": nodes,
"tasks_per_rs": max(ntasks // nodes, 1),
}
run_args.update(kwargs)
return JsrunSettings(exe, args, run_args=run_args)

return RunSettings(exe, args)

Expand Down
64 changes: 0 additions & 64 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,11 +59,9 @@ Types of Settings:
MpirunSettings
MpiexecSettings
OrterunSettings
JsrunSettings
DragonRunSettings
SbatchSettings
QsubBatchSettings
BsubBatchSettings

Settings objects can accept a container object that defines a container
runtime, image, and arguments to use for the workload. Below is a list of
Expand Down Expand Up @@ -187,41 +185,6 @@ for Slurm and PBS sessions, respectively).
:members:


.. _jsrun_api:

JsrunSettings
-------------


``JsrunSettings`` can be used on any system that supports the
IBM LSF launcher.

``JsrunSettings`` can be used in interactive session (on allocation)
and within batch launches (i.e. ``BsubBatchSettings``)


.. autosummary::

JsrunSettings.set_num_rs
JsrunSettings.set_cpus_per_rs
JsrunSettings.set_gpus_per_rs
JsrunSettings.set_rs_per_host
JsrunSettings.set_tasks
JsrunSettings.set_tasks_per_rs
JsrunSettings.set_binding
JsrunSettings.make_mpmd
JsrunSettings.set_mpmd_preamble
JsrunSettings.update_env
JsrunSettings.set_erf_sets
JsrunSettings.format_env_vars
JsrunSettings.format_run_args


.. autoclass:: JsrunSettings
:inherited-members:
:undoc-members:
:members:

.. _openmpi_run_api:

MpirunSettings
Expand Down Expand Up @@ -361,33 +324,6 @@ be launched as a batch on PBSPro systems.
:members:


.. _bsub_api:

BsubBatchSettings
-----------------


``BsubBatchSettings`` are used to configure jobs that should
be launched as a batch on LSF systems.


.. autosummary::

BsubBatchSettings.set_walltime
BsubBatchSettings.set_smts
BsubBatchSettings.set_project
BsubBatchSettings.set_nodes
BsubBatchSettings.set_expert_mode_req
BsubBatchSettings.set_hostlist
BsubBatchSettings.set_tasks
BsubBatchSettings.format_batch_args


.. autoclass:: BsubBatchSettings
:inherited-members:
:undoc-members:
:members:

.. _singularity_api:

Singularity
Expand Down
27 changes: 0 additions & 27 deletions doc/batch_settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ launching capabilities tailored for specific workload managers (WLMs). Each Smar
- :ref:`SbatchSettings<sbatch_api>`
- The PBS Pro `launcher` supports:
- :ref:`QsubBatchSettings<qsub_api>`
- The LSF `launcher` supports:
- :ref:`BsubBatchSettings<bsub_api>`

.. note::
The local `launcher` does not support batch jobs.
Expand Down Expand Up @@ -97,31 +95,6 @@ Below are examples of how to initialize a ``BatchSettings`` object per `launcher
If `launcher="auto"`, SmartSim will detect that the ``Experiment`` is running on a PBS Pro based
machine and set the launcher to `"pbs"`.

.. group-tab:: LSF
To instantiate the ``BsubBatchSettings`` object, which interfaces with the LSF job scheduler, specify
`launcher="lsf"` when initializing the ``Experiment``. Upon calling ``create_batch_settings``,
SmartSim will detect the job scheduler and return the appropriate batch settings object.

.. code-block:: python
from smartsim import Experiment
# Initialize the experiment and provide launcher LSF
exp = Experiment("name-of-experiment", launcher="lsf")
# Initialize a BsubBatchSettings object
bsub_batch_settings = exp.create_batch_settings(nodes=1, time="10:00:00", batch_args={"ntasks": 1})
# Set the account for the lsf batch job
bsub_batch_settings.set_account("12345-Cray")
# Set the partition for the lsf batch job
bsub_batch_settings.set_queue("default")
The initialized ``BsubBatchSettings`` instance can now be passed to a SmartSim entity
(``Model`` or ``Ensemble``) via the `batch_settings` argument in ``create_batch_settings``.

.. note::
If `launcher="auto"`, SmartSim will detect that the ``Experiment`` is running on a LSF based
machine and set the launcher to `"lsf"`.

.. warning::
Note that initialization values provided (e.g., `nodes`, `time`, etc) will overwrite the same arguments in `batch_args` if present.
18 changes: 12 additions & 6 deletions doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,31 @@ To be released at some point in the future

Description

- Terminate LSF and LSB support
- Implement workaround for Tensorflow that allows RedisAI to build with GCC-14
- Add instructions for installing SmartSim on PML's Scylla
- Fix typos in documentation

Detailed Notes

- After the supercomputer Summit was decommissioned, a decision was made to
terminate SmartSim's support of the LSF launcher and LSB scheduler. If
this impacts your work, please contact us.
([SmartSim-PR780](https://github.com/CrayLabs/SmartSim/pull/780))
- Fix typos in the `train_surrogate` tutorial documentation.
([SmartSim-PR758](https://github.com/CrayLabs/SmartSim/pull/758))
- PML's Scylla is still under development. The usual SmartSim
build instructions do not apply because the GPU dependencies
have yet to be installed at a system-wide level. Scylla has
its own entry in the documentation.
([SmartSim-PR733](https://github.com/CrayLabs/SmartSim/pull/733))
- In libtensorflow, the input argument to TF_SessionRun seems to be mistyped to
TF_Output instead of TF_Input. These two types differ only in name. GCC-14
catches this and throws an error, even though earlier versions allow this. To
solve this problem, patches are applied to the Tensorflow backend in RedisAI.
Future versions of Tensorflow may fix this problem, but for now this seems to be
the best workaround.
([SmartSim-PR738](https://github.com/CrayLabs/SmartSim/pull/738))
- PML's Scylla is still under development. The usual SmartSim
build instructions do not apply because the GPU dependencies
have yet to be installed at a system-wide level. Scylla has
its own entry in the documentation.
([SmartSim-PR733](https://github.com/CrayLabs/SmartSim/pull/733))
- Fix typos in the `train_surrogate` tutorial documentation


### 0.8.0
Expand Down
7 changes: 2 additions & 5 deletions doc/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ If any of the above commands are used, the test suite will run the "light" test
suite by default.


PBSPro, Slurm, LSF
PBSPro, Slurm, SGE
==================

To run the full test suite, users will have to be on a system with one of the
Expand All @@ -105,17 +105,14 @@ of at least 3 nodes.
# for PBSPro (with aprun)
qsub -l select=3 -l place=scatter -l walltime=00:10:00 -q queue
# for LSF (with jsrun)
bsub -Is -W 00:30 -nnodes 3 -P project $SHELL
Values for queue, account, or project should be substituted appropriately.

Once in an iterative allocation, users will need to set the test launcher
environment variable: ``SMARTSIM_TEST_LAUNCHER`` to one of the following values

- slurm
- pbs
- lsf
- sge
- local

If tests have to run on an account or project, the environment variable
Expand Down
6 changes: 3 additions & 3 deletions doc/experiment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Overview
SmartSim helps automate the deployment of AI-enabled workflows on HPC systems. With SmartSim, users
can describe and launch combinations of applications and AI/ML infrastructure to produce novel and
scalable workflows. SmartSim supports launching these workflows on a diverse set of systems, including
local environments such as Mac or Linux, as well as HPC job schedulers (e.g. Slurm, PBS Pro, and LSF).
local environments such as Mac or Linux, as well as HPC job schedulers (e.g. Slurm, PBS Pro, and SGE).

The ``Experiment`` API is SmartSim's top level API that provides users with methods for creating, combining,
configuring, launching and monitoring :ref:`entities<entities_exp_docs>` in an AI-enabled workflow. More specifically, the
Expand Down Expand Up @@ -49,7 +49,7 @@ workflow in the :ref:`Example<exp_example>` section of this page.
Launchers
=========
SmartSim supports launching AI-enabled workflows on a wide variety of systems, including locally on a Mac or
Linux machine or on HPC machines with a job scheduler (e.g. Slurm, PBS Pro, and LSF). When creating a SmartSim
Linux machine or on HPC machines with a job scheduler (e.g. Slurm, PBS Pro, and SGE). When creating a SmartSim
``Experiment``, the user has the opportunity to specify the `launcher` type or defer to automatic `launcher` selection.
`Launcher` selection determines how SmartSim translates entity configurations into system calls to launch,
manage, and monitor. Currently, SmartSim supports 7 `launcher` options:
Expand All @@ -58,7 +58,7 @@ manage, and monitor. Currently, SmartSim supports 7 `launcher` options:
2. ``slurm``: for systems using the Slurm scheduler
3. ``pbs``: for systems using the PBS Pro scheduler
4. ``pals``: for systems using the PALS scheduler
5. ``lsf``: for systems using the LSF scheduler
5. ``sge``: for systems using the SGE scheduler
6. ``dragon``: if Dragon is installed in the current Python environment, see :ref:`Dragon Install <dragon_install>`
7. ``auto``: have SmartSim auto-detect the launcher to use (will not detect ``dragon``)

Expand Down
2 changes: 1 addition & 1 deletion doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ The key features of the IL are:
- An API to start, monitor, and stop HPC jobs from Python or from a Jupyter notebook.
- Automated deployment of in-memory data staging (`Redis <https://redis.io>`_) and computational
storage (`RedisAI <https://redisai.io>`_).
- Programmatic launches of batch and in-allocation jobs on PBS, Slurm, and LSF systems.
- Programmatic launches of batch and in-allocation jobs on PBS, Slurm, and SGE systems.
- Creating and configuring ensembles of workloads with isolated communication channels.

The IL can configure and launch batch jobs as well as jobs within interactive
Expand Down
Loading

0 comments on commit f509ad6

Please sign in to comment.