Skip to content

Commit

Permalink
Merge branch 'master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
uschnoor authored Jun 28, 2018
2 parents f267641 + 7e8f8b0 commit 8cca051
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 4 deletions.
2 changes: 1 addition & 1 deletion publication/Draft_CompSoftwBigScience/include/HTCondor.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
The open-source HTCondor project provides a workload management system which is highly configurable and modular~\cite{HTCondor}. Batch processing workflows can be submitted and are then forwarded by HTCondor to idle resources. HTCondor maintains a resource pool, which worker nodes in a local or remote cluster can join. Once HTCondor has verified the authenticity and features of the newly joined machines, computing jobs are automatically transferred. Special features to connect from within isolated network zones, for example via a NAT-Portal, to the central HTCondor pool are available. The Connection Brokering (CCB) service is especially valuable to connect virtual machines to the central pool. These features and the well-known ability of HTCondor to scale to O(100k) of parallel batch jobs lets us decide to use HTCondor as a workload management system.
The open-source HTCondor project provides a workload management system which is highly configurable and modular~\cite{HTCondor}. Batch processing workflows can be submitted and are then forwarded by HTCondor to idle resources. HTCondor maintains a resource pool, which worker nodes in a local or remote cluster can join. Once HTCondor has verified the authenticity and features of the newly joined machines, computing jobs are automatically transferred. Special features to connect from within isolated network zones, for example via a NAT-Portal, to the central HTCondor pool are available. The Connection Brokering (CCB) service is especially valuable to connect virtual machines to the central pool. These features and the well-known ability of HTCondor to scale to O(100k) of parallel batch jobs makes HTCondor well suited as a workload management system for the use cases described in this paper.

The virtual machines spawned for the CMS user group of the KIT come with the HTCondor client (\texttt{startd}) pre-installed and this client is started after the machine has fully booted and connects to the central HTCondor pool at the KIT via a shared secret. Due to HTCondor's dynamic design, new machines in the pool will automatically receive jobs and the transfer of the job configuration and meta-data files is handled via HTCondor's internal file transfer systems.
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ \subsection{Computing at the University of Freiburg}
utilization of the installed hardware are necessary.
Transfering expertise from the operation of the established local private cloud,
the use of OpenStack as a cloud platform has been identified as a
suitable solution for the bwForCluster NEMO to provide a more flexible software
suitable solution for NEMO to provide a more flexible software
deployment in addition to the existing software module system.
This produced a couple of
implications ranging from challenges in the automated creation of suitable
Expand All @@ -42,7 +42,7 @@ \subsection{Separation of software environments}
applications and configurations can be controlled autonomously by the research groups.

To allow more flexible software environments the standard bare metal
operation of the bwForCluster NEMO is extended with a parallel installation of OpenStack
operation of NEMO is extended with a parallel installation of OpenStack
components~\cite{hpc-symp:2016}.
The NEMO cluster uses Adaptive's Moab Workload Manager~\cite{Moab} as a
scheduler of compute jobs.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ \subsection{ROCED}

%$\to$ CMS Karlsruhe
Many capable batch systems exist today and they can be interfaced to virtualization providers using the cloud meta-scheduler ROCED (Responsive On-demand Cloud Enabled Deployment) which has been developed at the KIT since 2010~\cite{ROCED}. ROCED is written in a modular
fashion and the interfaces to batch systems and cloud sites are implemented as so-called \textit{Adapters}. This makes ROCED independent of a specific user group or workflow. It provides a scheduling core which collects the current requirement of computing resources and decides if virtual machines need to be started or can be stopped. One or more Requirement Adapters report the current queue status of batch systems to the central scheduling core. Currently, Requirement Adapters are implemented for the Slurm, Torque, HTCondor and GridEngine batch systems. The Site Adapters allow ROCED to start, stop and monitor virtual machines on multiple cloud sites. Implementations exist for Amazon EC2, OpenStack, OpenNebula and Maob-based virtualization at HPC centers. Special care has been put into the resilience of ROCED: it can automatically terminate non-responsive machines and restart virtual machines in case some machines dropped out. This allows VM setups orchestrated by ROCED with thousands of virtual machines and many tens of thousands of jobs to run in production environments.
fashion in python and the interfaces to batch systems and cloud sites are implemented as so-called \textit{Adapters}. This makes ROCED independent of a specific user group or workflow. It provides a scheduling core which collects the current requirement of computing resources and decides if virtual machines need to be started or can be stopped. One or more Requirement Adapters report the current queue status of batch systems to the central scheduling core. Currently, Requirement Adapters are implemented for the Slurm, Torque, HTCondor and GridEngine batch systems. The Site Adapters allow ROCED to start, stop and monitor virtual machines on multiple cloud sites. Implementations exist for Amazon EC2, OpenStack, OpenNebula and Moab-based virtualization at HPC centers. Special care has been put into the resilience of ROCED: it can automatically terminate non-responsive machines and restart virtual machines in case some machines dropped out. This allows VM setups orchestrated by ROCED with thousands of virtual machines and many tens of thousands of jobs to run in production environments.


\subsection{Using HTCondor as front-end scheduler}\label{sec:ROCED:HTCondor}
Expand Down

0 comments on commit 8cca051

Please sign in to comment.