use omp to parallelize get_solposAM #16

mikofski · 2019-05-22T21:55:03Z

Issue #15 suggests one way to speed up the loop in get_solposAM is to use openMP. However, in tests, I did not see an improvement in performance with omp:

In [1]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(100000)]

In [2]: %timeit get_solposAM(location, times, weather)
199 ms ± 828 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(10000)]

In [4]: %timeit get_solposAM(location, times, weather)
22.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(1000)]

In [6]: %timeit get_solposAM(location, times, weather)
2.13 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

OpenMP is implemented by default in both MS Windows MSVC and GCC (on Linux and Darwin?). It requires a compiler flag /openmp on Windows for MSVC and -fopenmp for GCC on Linux and Darwin?

In setup.py:

diff --git a/setup.py b/setup.py
index 8507a2e..8ea66f0 100644
--- a/setup.py
+++ b/setup.py
@@ -36,18 +36,19 @@ PLATFORM = sys.platform
 if PLATFORM == 'win32':
     LIB_FILE = '%s.dll'
     MACROS = [('WIN32', None)]
+    CCFLAGS = ['/openmp']
     if PYVERSION.major >= 3 and PYVERSION.minor >= 5:
         LDFLAGS = ['/DLL']
 elif PLATFORM == 'darwin':
     LIB_FILE = 'lib%s.dylib'
     RPATH = "-Wl,-rpath,@loader_path/"
     INSTALL_NAME = "@rpath/" + LIB_FILE
-    CCFLAGS = LDFLAGS = ['-fPIC']
+    CCFLAGS = LDFLAGS = ['-fPIC', '-fopenmp']
 elif PLATFORM in ['linux', 'linux2']:
     PLATFORM = 'linux'
     LIB_FILE = 'lib%s.so'
     RPATH = "-Wl,-rpath,${ORIGIN}"
-    CCFLAGS = LDFLAGS = ['-fPIC']
+    CCFLAGS = LDFLAGS = ['-fPIC', '-fopenmp']
 else:
     sys.exit('Platform "%s" is unknown or unsupported.' % PLATFORM)

Then in the source include <omp.h> and call the OpenMP pragma with the shared and private variables, as well as the chunk size and scheduling:

diff --git a/solar_utils/src/solposAM.c b/solar_utils/src/solposAM.c
index 3710703..7b986f1 100644
--- a/solar_utils/src/solposAM.c
+++ b/solar_utils/src/solposAM.c
@@ -4,6 +4,7 @@
 #include <math.h>
 #include <string.h>
 #include <stdio.h>
+#include <omp.h>

 // include solpos header
 // contains documentation, function and structure prototypes, enumerations and
@@ -111,7 +112,10 @@ DllExport long get_solposAM( float location[3], int datetimes[][6],
     int settings[][2], float orientation[][2], float shadowband[][3],
     long err_code[])
 {
-    for (size_t i=0; i<cnt; i++){
+    int i;
+    int ncores = 4;
+    int chunk = (int)(cnt / ncores);
+    #pragma omp parallel for shared(err_code,datetimes,angles,airmass,settings, \
+            orientation,shadowband) private(i) schedule(static,chunk)
+    for (i=0; i<cnt; i++){
         err_code[i] = solposAM( location, datetimes[i], weather, angles[i],
             airmass[i], settings[i], orientation[i], shadowband[i] );
     }

Note: I didn't see any difference with these variations:

no schedule, let OpenMP set the schedule automatically
no chunk size, let OpenMP set the chunk size automatically
I only used static since each calculation is nearly identical, so all chunks/threads should take the same time
shared and private directives are required, omp can't figure out which is which just by inspecting this particular for-loop, so must be explicit, including declaring index variable i before the loop
I confirmed that all processors are being used, but the calculation is SO fast that it is an inefficient use of multiprocessing. This might be better for SIMD or AVX which is perhaps why numpy or numexpr would be better? There is a new SIMD directive for OpenMP 4.0 but not sure if it is standard in MSVC or GCC yet, and really this is just too much now.

See this SO answer on when to use SIMD vs. parallel

However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.

The text was updated successfully, but these errors were encountered:

mikofski · 2019-05-22T21:55:49Z

I mainly added this issue for posterity, since this is really a dead end IMO. Closing for now, but feel free to reopen in the future if anything changes.

mikofski · 2019-05-22T21:59:19Z

PS I can confirm that MSVC 2017 (or is it 2015) does not have the simd directive for OpenMP, I think I read somewhere that it is omp-2.0, but SIMD is only available in omp-4.0.

mikofski closed this as completed May 22, 2019

mikofski mentioned this issue May 23, 2019

ported to numpy with numba #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use omp to parallelize get_solposAM #16

use omp to parallelize get_solposAM #16

mikofski commented May 22, 2019 •

edited

Loading

mikofski commented May 22, 2019

mikofski commented May 22, 2019

use omp to parallelize get_solposAM #16

use omp to parallelize get_solposAM #16

Comments

mikofski commented May 22, 2019 • edited Loading

mikofski commented May 22, 2019

mikofski commented May 22, 2019

mikofski commented May 22, 2019 •

edited

Loading