From f60233b00f823f45635b873a5f4e12c15827dc08 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Sun, 13 Oct 2024 00:42:04 +0200
Subject: [PATCH] Improve doc

---
 docs/make.jl                                  |  2 +-
 docs/src/concepts/distributed.md              |  6 ++--
 docs/src/getting_started.md                   |  2 +-
 ..._mpi_support.md => using_chmy_with_mpi.md} | 35 +++++++++++--------
 examples/diffusion_2d_mpi.jl                  |  2 +-
 5 files changed, 27 insertions(+), 20 deletions(-)
 rename docs/src/{using_chmy_with_mpi_support.md => using_chmy_with_mpi.md} (50%)

diff --git a/docs/make.jl b/docs/make.jl
index 43cba5fb..52db8d94 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -16,7 +16,7 @@ makedocs(
     pages = Any[
         "Home" => "index.md",
         "Getting Started with Chmy.jl" => "getting_started.md",
-        "Using Chmy.jl with MPI Support" => "using_chmy_with_mpi_support.md",
+        "Using Chmy.jl with MPI" => "using_chmy_with_mpi.md",
         "Concepts" => Any["concepts/architectures.md",
                         "concepts/grids.md",
                         "concepts/fields.md",
diff --git a/docs/src/concepts/distributed.md b/docs/src/concepts/distributed.md
index d7220f53..e2e5f463 100644
--- a/docs/src/concepts/distributed.md
+++ b/docs/src/concepts/distributed.md
@@ -1,12 +1,12 @@
 # Distributed
 
-**Task-based parallelism** in [Chmy.jl](https://github.com/PTsolvers/Chmy.jl) is featured by the usage of [`Threads.@spawn`](https://docs.julialang.org/en/v1/base/multi-threading/#Base.Threads.@spawn), with an additional layer of a [Worker](../developer_documentation/workers.md) construct for efficiently managing the lifespan of tasks. Note that the task-based parallelism provides a high-level abstraction of program execution **not only** for **shared-memory architecture** on a single machine, but it can be also extended to **hybrid parallelism**, consisting of both shared and distributed-memory parallelism. The `Distributed` module in Chmy.jl allows users to leverage the hybrid parallelism through the power of abstraction.
+**Task-based parallelism** in [Chmy.jl](https://github.com/PTsolvers/Chmy.jl) is featured by the usage of [`Threads.@spawn`](https://docs.julialang.org/en/v1/base/multi-threading/#Base.Threads.@spawn), with an additional layer of a [Worker](../developer_documentation/workers.md) construct for efficiently managing the lifespan of tasks. Note that the task-based parallelism provides a high-level abstraction of program execution **not only** for **shared-memory architecture** on a single device, but it can be also extended to **hybrid parallelism**, consisting of both shared and distributed-memory parallelism. The `Distributed` module in Chmy.jl allows users to leverage the hybrid parallelism through the power of abstraction.
 
 We will start with some basic background knowledge for understanding the architecture of modern HPC clusters, the underlying memory model and the programming paradigm complied with it.
 
 ## HPC Cluster & Distributed Memory
 
-An **high-performance computing (HPC)** cluster consists of a **network** of independent computers combined into a system through specialized hardware. We call each computer a *node*, and each node manages its own private memory. Such system with interconnected nodes, without having access to memory of any other node, features the **distributed memory model**. The underlying fast interconnect architecture (e.g. *InfiniBand*) that physically connects the nodes in the **network** can transfer the data from one node to another in an extremely efficient manner.
+An **high-performance computing (HPC)** cluster consists of a **network** of independent computers combined into a system through specialised hardware. We call each computer a *node*, and each node manages its own private memory. Such system with interconnected nodes, without having access to memory of any other node, features the **distributed memory model**. The underlying fast interconnect architecture (e.g. *InfiniBand*) that physically connects the nodes in the **network** can transfer the data from one node to another in an extremely efficient manner.
 
 ```@raw html
 <center>
@@ -14,7 +14,7 @@ An **high-performance computing (HPC)** cluster consists of a **network** of ind
 </center>
 ```
 
-By using the fast interconnection, processes across different nodes can communicate with each other through the sending of messages in a high-throughput, low-latency fashion. The syntax and semantics of how **message passing** should proceed through such network is defined by a standard called the **Message-Passing Interface (MPI)**, and there are different libraries that implement the standard, resulting in a wide range of choice (MPICH, Open MPI, MVAPICH etc.) for users. [MPI.jl](https://github.com/JuliaParallel/MPI.jl) package provides a high-level API for Julia users to call library routines of an implementation of user's choice.
+By using the fast interconnection, processes across different nodes can communicate with each other through the exchange of messages in a high-throughput, low-latency fashion. The syntax and semantics of how **message passing** should proceed through such network is defined by a standard called the **Message-Passing Interface (MPI)**, and there are different libraries that implement the standard, resulting in a wide range of choice (MPICH, Open MPI, MVAPICH etc.) for users. [MPI.jl](https://github.com/JuliaParallel/MPI.jl) package provides a high-level API for Julia users to call library routines of an implementation of user's choice.
 
 
 !!! info "Message-Passing Interface (MPI) is a General Specification"
diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
index 45de62a5..2fab7841 100644
--- a/docs/src/getting_started.md
+++ b/docs/src/getting_started.md
@@ -1,6 +1,6 @@
 # Getting Started with Chmy.jl
 
-[Chmy.jl](https://github.com/PTsolvers/Chmy.jl) is a backend-agnostic toolkit for finite difference computations on multi-dimensional computational staggered grids. In this introductory tutorial, we will showcase the essence of Chmy.jl by solving a simple 2D diffusion problem. The full code of the tutorial material is available under [diffusion_2d.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d.jl).
+[Chmy.jl](https://github.com/PTsolvers/Chmy.jl) is a backend-agnostic toolkit for finite difference computations on multi-dimensional computational staggered grids. In this introductory tutorial, we showcase the essence of Chmy.jl by solving a simple 2D diffusion problem. The full code of the tutorial material is available under [diffusion_2d.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d.jl).
 
 ## Basic Diffusion
 
diff --git a/docs/src/using_chmy_with_mpi_support.md b/docs/src/using_chmy_with_mpi.md
similarity index 50%
rename from docs/src/using_chmy_with_mpi_support.md
rename to docs/src/using_chmy_with_mpi.md
index 205046ad..a33360aa 100644
--- a/docs/src/using_chmy_with_mpi_support.md
+++ b/docs/src/using_chmy_with_mpi.md
@@ -1,9 +1,9 @@
 # Using Chmy.jl with MPI Support
 
-In this tutorial, we are now going to dive into the `Distributed` module in [Chmy.jl](https://github.com/PTsolvers/Chmy.jl) and learn how to run our code on multiple nodes in a typical HPC cluster setup. We start from the [diffusion_2d.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d.jl) code that we created in the previous tutorial section [Getting Started with Chmy.jl](./getting_started.md).
+In this tutorial, we dive into the `Distributed` module in [Chmy.jl](https://github.com/PTsolvers/Chmy.jl) and learn how to run our code on multiple processes in a typical HPC cluster setup. We start from the [diffusion_2d.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d.jl) code from the tutorial section [Getting Started with Chmy.jl](./getting_started.md).
 
 !!! warning "Experience with HPC Clusters Assumed"
-    In this tutorial, we assume our users have already worked with HPC clusters and are rather familiar with the basic concepts of distributed computing. If you find anything conceptually difficult to start with, have a look at our concept documentation on the [Distributed](./concepts/distributed.md) module.
+    In this tutorial, we assume users to be familiar with HPC clusters and the basic concepts of distributed computing. If you find anything conceptually difficult to start with, have a look at the concept documentation on the [Distributed](./concepts/distributed.md) module.
 
 We need to make the following changes to our code to enable MPI support, in which we:
 
@@ -17,7 +17,7 @@ We need to make the following changes to our code to enable MPI support, in whic
 
 5. finalise MPI Environment
 
-## Initialise MPI Environment & Specify Distributed Architecture
+## Initialise MPI & Specify Distributed Architecture
 
 The first step is to load the `MPI.jl` module and initialise MPI with `MPI.Init()` at the beginning of the program.
 
@@ -39,7 +39,7 @@ The `global_rank()` function provides a convenient method for users to retrieve
 
 ## Redefine Geometry
 
-In the original single-node setup, we defined a global grid that covered the entire computational domain. 
+In the original single-node setup, we defined a global grid that covered the entire computational domain.
 
 ```julia
 @views function main(backend=CPU(); nxy=(126, 126))
@@ -53,34 +53,38 @@ end
 main(; nxy=(128, 128) .- 2)
 ```
 
-This approach worked well when all computations were performed on a single machine. However, in a distributed environment, we need to redefine the grid to accommodate multiple MPI processes, ensuring each process handles a portion of the overall domain. To adjust our geometry to the distributed environment, we need to redefine the grid dimensions to accommodate all MPI processes. The modified version creates a local grid for each MPI process and adjusts the global grid dimensions based on the MPI topology:
+In Chmy.jl, the grid constructor (here `UniformGrid`) always takes takes the dimensions `dims` of the global grid as input argument and returns the local `grid` object. For single-device architecture (no MPI) the local `grid` is equivalent to the global `grid` given that a single process performs the computations on the entire domain. For distributed architecture, `dims` still takes as input the global grid dimension but returns the local portion of the grid corresponding to each MPI rank.
+
+Following a GPU-centric approach, we want to control the local dimension of the grid, `nxy_l` hereafter, to ensure optimal execution on a single GPU. We thus need to reconstruct the global grid dimension `dims_g` based on the MPI topology and the local grid dimension `dims_l`:
 
 ```julia
 @views function main(backend=CPU(); nxy_l=(126, 126))
     # After: geometry
     dims_l = nxy_l
-    dims_g = dims_l .* dims(topo); nx, ny = dims_g
+    dims_g = dims_l .* dims(topo)
+    nx, ny = dims_g
     grid   = UniformGrid(arch; origin=(-2, -2), extent=(4, 4), dims=dims_g)
     launch = Launcher(arch, grid, outer_width=(16, 8))
-    
+
     # ...
 end
 
 main(; nxy_l=(128, 128) .- 2)
 ```
 
-Here, `dims_g` represents the global dimensions of the grid, which are obtained by multiplying the local grid dimensions `dims_l` by the MPI topology dimensions. The `outer_width` parameter specifies the number of ghost cells or padding layers that are added to each local grid. These ghost cells are used to handle boundary conditions between neighboring processes, ensuring that data is exchanged correctly during computation.
+Here, `dims_g` represents the global dimension of the grid, which is obtained by multiplying the local grid dimensions `dims_l` by the MPI topology dimensions. The `outer_width` parameter specifies the number of grid points that constitute the boundary region of each local grid. This approach is used to perform asynchronous computations on each local domain and to overlap boundary conditions computations (including MPI communication) with inner point computations. This allows to hide MPI communication latency overlapping communication and computations.
+
 
 ## Avoid Redundant I/O Operations
 
-Previously, having the view of a single machine in mind, we can simply print out any information during the code execution, whether it is the value of some physical properties that we want to monitor about or the current number of iterations during the simulation.
+Previously, having the view of a single process in mind, we can simply print out any information during the code execution, whether it is the value of some physical properties that we want to monitor about or the current number of iterations during the simulation.
 
 ```julia
 # Before: prints out current no. iteration on a single node
 @printf("it = %d/%d \n", it, nt)
 ```
 
-In a distributed setup, on the other hand, all MPI processes would have displayed the same line with this statement. In order to prevent this redundancy, we utilize the unique process ID to determine whether the process that is currently running is the one that we have assigned to handle the I/O task.
+In a distributed setup, on the other hand, all MPI processes execute the same code and would have displayed the same line with this statement. In order to prevent this redundancy, we utilise the unique process ID to determine whether the process that is currently running is the one that we have assigned to handle the I/O task.
 
 ```julia
 # After: specifying only process with ID == 0 to perform I/O operations
@@ -89,13 +93,13 @@ In a distributed setup, on the other hand, all MPI processes would have displaye
 
 ## Data Gathering for Visualisation
 
-Now we want to visualize the field `C` as we did in the previous tutorial on a single machine. But previously we split up `C` across various MPI processes. Each process handles a portion of the computation, leading to the necessity of data gathering for visualisation. Let us define a global array `C_v` that should gather all data from other MPI processes to the MPI process that has the unique process ID equals zero `me==0`.
+Not addressing parallel I/O here, we want to visualise the field `C`. But previously we split up `C` across various MPI processes. Each process handles a portion of the computation, leading to the necessity of data gathering for visualisation. Let us define a global array `C_v` on our defined master MPI process with unique process ID equals zero (`me==0`) that should gather all data from other MPI processes to later perform visualisation. Note the the global size of `C_v` being `interior(C)) .* dims(topo)`.
 
 ```julia
 C_v = (me==0) ? KernelAbstractions.zeros(CPU(), Float64, size(interior(C)) .* dims(topo)) : nothing
 ```
 
-We use `gather!(arch, C_v, C)` to explicitly perform a data synchronisation and collect local values of `C` that are decomposed into different arrays stored in the memory space of other MPI processes. And similar to the `@printf` example above, only one MPI process does the visualisation.
+We use `gather!(arch, C_v, C)` to explicitly perform a data synchronisation and collect local values of `C` that are decomposed into different arrays stored in the memory space of other MPI processes. And similar to the `@printf` example above, only our master MPI process does the visualisation.
 
 ```julia
 # Before: local postprocess
@@ -115,7 +119,7 @@ if me == 0
 end
 ```
 
-## Finalise MPI Environment
+## Finalise MPI
 
 At the very end of the program, we need to call `MPI.Finalize()` to clean up the MPI state.
 
@@ -123,4 +127,7 @@ At the very end of the program, we need to call `MPI.Finalize()` to clean up the
 MPI.Finalize()
 ```
 
-Note that we need not to do any changes for defining or launching kernels, as they are already MPI-compatible and need no further modification. The full code of the tutorial material is available under [diffusion\_2d\_mpi.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d_mpi.jl).
\ No newline at end of file
+!!! note "MPI finalisation"
+    Running a Julia MPI code on a single process within the REPL (for e.g. development purpose) will require to terminate the Julia session upon MPI finalisation. Simply omitting `MPI.Finalize` will allow for repeated execution of the code.
+
+Note that we need not to do any changes for defining or launching kernels, as they are already MPI-compatible and need no further modification. The full code of the tutorial material is available under [diffusion\_2d\_mpi.jl](https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d_mpi.jl).
diff --git a/examples/diffusion_2d_mpi.jl b/examples/diffusion_2d_mpi.jl
index c6f2003e..25354e5a 100644
--- a/examples/diffusion_2d_mpi.jl
+++ b/examples/diffusion_2d_mpi.jl
@@ -1,7 +1,7 @@
 using Chmy, Chmy.Architectures, Chmy.Grids, Chmy.Fields, Chmy.BoundaryConditions, Chmy.GridOperators, Chmy.KernelLaunch
 using KernelAbstractions
 using Printf
-# using CairoMakie
+using CairoMakie
 
 # using AMDGPU
 # AMDGPU.allowscalar(false)