Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use omp to parallelize get_solposAM #16

Closed
mikofski opened this issue May 22, 2019 · 2 comments
Closed

use omp to parallelize get_solposAM #16

mikofski opened this issue May 22, 2019 · 2 comments

Comments

@mikofski
Copy link
Contributor

mikofski commented May 22, 2019

Issue #15 suggests one way to speed up the loop in get_solposAM is to use openMP. However, in tests, I did not see an improvement in performance with omp:

In [1]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(100000)]

In [2]: %timeit get_solposAM(location, times, weather)
199 ms ± 828 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(10000)]

In [4]: %timeit get_solposAM(location, times, weather)
22.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: from solar_utils import solposAM, get_solposAM
   ...: location = [35.56836, -119.2022, -8.0]
   ...: weather = [1015.62055, 40.0]
   ...: from datetime import datetime, timedelta
   ...: times = [(datetime(2017,1,1,0,0,0)+timedelta(hours=h)).timetuple()[:6] for h in range(1000)]

In [6]: %timeit get_solposAM(location, times, weather)
2.13 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

OpenMP is implemented by default in both MS Windows MSVC and GCC (on Linux and Darwin?). It requires a compiler flag /openmp on Windows for MSVC and -fopenmp for GCC on Linux and Darwin?

In setup.py:

diff --git a/setup.py b/setup.py
index 8507a2e..8ea66f0 100644
--- a/setup.py
+++ b/setup.py
@@ -36,18 +36,19 @@ PLATFORM = sys.platform
 if PLATFORM == 'win32':
     LIB_FILE = '%s.dll'
     MACROS = [('WIN32', None)]
+    CCFLAGS = ['/openmp']
     if PYVERSION.major >= 3 and PYVERSION.minor >= 5:
         LDFLAGS = ['/DLL']
 elif PLATFORM == 'darwin':
     LIB_FILE = 'lib%s.dylib'
     RPATH = "-Wl,-rpath,@loader_path/"
     INSTALL_NAME = "@rpath/" + LIB_FILE
-    CCFLAGS = LDFLAGS = ['-fPIC']
+    CCFLAGS = LDFLAGS = ['-fPIC', '-fopenmp']
 elif PLATFORM in ['linux', 'linux2']:
     PLATFORM = 'linux'
     LIB_FILE = 'lib%s.so'
     RPATH = "-Wl,-rpath,${ORIGIN}"
-    CCFLAGS = LDFLAGS = ['-fPIC']
+    CCFLAGS = LDFLAGS = ['-fPIC', '-fopenmp']
 else:
     sys.exit('Platform "%s" is unknown or unsupported.' % PLATFORM)

Then in the source include <omp.h> and call the OpenMP pragma with the shared and private variables, as well as the chunk size and scheduling:

diff --git a/solar_utils/src/solposAM.c b/solar_utils/src/solposAM.c
index 3710703..7b986f1 100644
--- a/solar_utils/src/solposAM.c
+++ b/solar_utils/src/solposAM.c
@@ -4,6 +4,7 @@
 #include <math.h>
 #include <string.h>
 #include <stdio.h>
+#include <omp.h>

 // include solpos header
 // contains documentation, function and structure prototypes, enumerations and
@@ -111,7 +112,10 @@ DllExport long get_solposAM( float location[3], int datetimes[][6],
     int settings[][2], float orientation[][2], float shadowband[][3],
     long err_code[])
 {
-    for (size_t i=0; i<cnt; i++){
+    int i;
+    int ncores = 4;
+    int chunk = (int)(cnt / ncores);
+    #pragma omp parallel for shared(err_code,datetimes,angles,airmass,settings, \
+            orientation,shadowband) private(i) schedule(static,chunk)
+    for (i=0; i<cnt; i++){
         err_code[i] = solposAM( location, datetimes[i], weather, angles[i],
             airmass[i], settings[i], orientation[i], shadowband[i] );
     }

Note: I didn't see any difference with these variations:

  • no schedule, let OpenMP set the schedule automatically
  • no chunk size, let OpenMP set the chunk size automatically
  • I only used static since each calculation is nearly identical, so all chunks/threads should take the same time
  • shared and private directives are required, omp can't figure out which is which just by inspecting this particular for-loop, so must be explicit, including declaring index variable i before the loop
  • I confirmed that all processors are being used, but the calculation is SO fast that it is an inefficient use of multiprocessing. This might be better for SIMD or AVX which is perhaps why numpy or numexpr would be better? There is a new SIMD directive for OpenMP 4.0 but not sure if it is standard in MSVC or GCC yet, and really this is just too much now.

See this SO answer on when to use SIMD vs. parallel

However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.

@mikofski
Copy link
Contributor Author

I mainly added this issue for posterity, since this is really a dead end IMO. Closing for now, but feel free to reopen in the future if anything changes.

@mikofski
Copy link
Contributor Author

PS I can confirm that MSVC 2017 (or is it 2015) does not have the simd directive for OpenMP, I think I read somewhere that it is omp-2.0, but SIMD is only available in omp-4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant