You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue #15 suggests one way to speed up the loop in get_solposAM is to use openMP. However, in tests, I did not see an improvement in performance with omp:
OpenMP is implemented by default in both MS Windows MSVC and GCC (on Linux and Darwin?). It requires a compiler flag /openmp on Windows for MSVC and -fopenmp for GCC on Linux and Darwin?
Then in the source include <omp.h> and call the OpenMP pragma with the shared and private variables, as well as the chunk size and scheduling:
diff --git a/solar_utils/src/solposAM.c b/solar_utils/src/solposAM.c
index 3710703..7b986f1 100644
--- a/solar_utils/src/solposAM.c+++ b/solar_utils/src/solposAM.c@@ -4,6 +4,7 @@
#include <math.h>
#include <string.h>
#include <stdio.h>
+#include <omp.h>
// include solpos header
// contains documentation, function and structure prototypes, enumerations and
@@ -111,7 +112,10 @@ DllExport long get_solposAM( float location[3], int datetimes[][6],
int settings[][2], float orientation[][2], float shadowband[][3],
long err_code[])
{
- for (size_t i=0; i<cnt; i++){+ int i;+ int ncores = 4;+ int chunk = (int)(cnt / ncores);+ #pragma omp parallel for shared(err_code,datetimes,angles,airmass,settings, \+ orientation,shadowband) private(i) schedule(static,chunk)+ for (i=0; i<cnt; i++){
err_code[i] = solposAM( location, datetimes[i], weather, angles[i],
airmass[i], settings[i], orientation[i], shadowband[i] );
}
Note: I didn't see any difference with these variations:
no schedule, let OpenMP set the schedule automatically
no chunk size, let OpenMP set the chunk size automatically
I only used static since each calculation is nearly identical, so all chunks/threads should take the same time
shared and private directives are required, omp can't figure out which is which just by inspecting this particular for-loop, so must be explicit, including declaring index variable i before the loop
I confirmed that all processors are being used, but the calculation is SO fast that it is an inefficient use of multiprocessing. This might be better for SIMD or AVX which is perhaps why numpy or numexpr would be better? There is a new SIMD directive for OpenMP 4.0 but not sure if it is standard in MSVC or GCC yet, and really this is just too much now.
However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.
The text was updated successfully, but these errors were encountered:
I mainly added this issue for posterity, since this is really a dead end IMO. Closing for now, but feel free to reopen in the future if anything changes.
PS I can confirm that MSVC 2017 (or is it 2015) does not have the simd directive for OpenMP, I think I read somewhere that it is omp-2.0, but SIMD is only available in omp-4.0.
Issue #15 suggests one way to speed up the loop in
get_solposAM
is to use openMP. However, in tests, I did not see an improvement in performance with omp:OpenMP is implemented by default in both MS Windows MSVC and GCC (on Linux and Darwin?). It requires a compiler flag
/openmp
on Windows for MSVC and-fopenmp
for GCC on Linux and Darwin?In
setup.py
:Then in the source include
<omp.h
> and call the OpenMPpragma
with the shared and private variables, as well as the chunk size and scheduling:Note: I didn't see any difference with these variations:
i
before the loopSee this SO answer on when to use SIMD vs. parallel
The text was updated successfully, but these errors were encountered: