Run "mba" in OpenCL #126

skn123 · 2014-05-04T18:04:43Z

Currently in mba_benchmark.cpp, we see that the Spline creation module is run in CPU while the interpolation is done in GPU. This makes sense. However, would it be possible to make this portion (i.e., Spline creation module) work in GPU?

ddemidov · 2014-05-04T18:13:22Z

I don't remember all the implementation details right away, but it seems it would be possible to do the setup phase in OpenCL. It could even make sense because all underlying structures are regular.

It will, however, take some time, because I am a bit busy at the moment.

skn123 · 2014-05-04T18:16:00Z

Denis, The setup was done using lot of C++11 code :) That's why I was having a hard time understanding it, but I have a suspicion that it can be done. Let me know if I can help in any way.

ddemidov · 2014-05-04T18:21:52Z

I think it should be enough to implement the control lattice structure with OpenCL. See the referenced paper for the details of the algorithm.

The biggest problem is that it needs to be done in generic way w.r.t. the number of dimensions, so one would need to do some OpenCL code generation.

skn123 · 2014-05-04T18:39:16Z

Why not take the FFT route? Handcraft those kernels for 1,2 and 3D and make it work on the device for these 3 dimensions. For all other dimensions, it can continue to use the current Host option?

ddemidov · 2014-05-04T18:48:56Z

I don't like the idea of keeping separate (but very similar) kernels when they all may be generated from a single source.

skn123 · 2014-05-07T11:37:03Z

If you are referring to the basic BA algorithm in the paper, then there are loops that can easily be "OpenMP-fied" even in the current implementation.

ddemidov · 2014-05-07T11:47:38Z

Yes. Feel free to provide a pull request :).

See #126

ddemidov · 2014-05-09T19:02:04Z

So I looked at mba implementation a bit closer. Now I know why I decided to stay on the CPU for the initialization.

First, and least important, VexCL supports parallel work with multiple compute devices. Since MBA may take a random set of coordinates to get interpolated values at, each device has to hold a complete copy of the control lattice phi. It did not seem right to either duplicate the initialization work for each device, or do initialization on the first device and then transfer the lattice to the rest of the devices.

Second, in the initialization loop over the data points here temporary arrays delta and omega are updated at some random positions. This could lead to data races both with OpenMP and OpenCL parallelizations and would require use of atomic operations. This would negatively affect performance.

For example, take a look at c217b22. Here are results of mba_benchmark with and without openmp parallelization:

No OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.889 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.53%)
[  GPU:               3.711 sec.] ( 41.75%)
[    setup:           2.320 sec.] ( 26.10%)
[    interpolate:     1.385 sec.] ( 15.58%)
[  CPU:               5.131 sec.] ( 57.72%)
[    setup:           2.324 sec.] ( 26.14%)
[    interpolate:     2.806 sec.] ( 31.57%)

OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:            17.936 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.26%)
[  GPU:              12.662 sec.] ( 70.60%)
[    setup:          11.137 sec.] ( 62.09%)
[    interpolate:     1.519 sec.] (  8.47%)
[  CPU:               5.226 sec.] ( 29.14%)
[    setup:           2.322 sec.] ( 12.95%)
[    interpolate:     2.903 sec.] ( 16.19%)

Note that setup now takes 11 seconds instead of 2.3.

So unless I did something wrong here, it seems its better to leave the current MBA setup as it is.

refs #126

ddemidov · 2014-05-09T19:25:16Z

Ok, I did something wrong here. After replacing critical section with atomic in ecad92c. mba_bechmark output looks like this:

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.834 sec.] (100.00%)
[  generate data:     0.046 sec.] (  0.52%)
[  GPU:               3.642 sec.] ( 41.23%)
[    setup:           2.251 sec.] ( 25.48%)
[    interpolate:     1.386 sec.] ( 15.69%)
[  CPU:               5.145 sec.] ( 58.25%)
[    setup:           2.327 sec.] ( 26.34%)
[    interpolate:     2.818 sec.] ( 31.90%)

This is better than serial version, but only slightly. This also introduces requirement that iterators to coordinates and values of data points are pointing at continuous chunks of memory. I am not sure the neglectable speedup worth it. What do you think?

skn123 · 2014-05-10T07:51:25Z

If I look at BA algorithm in the paper, there are three main loops.
If I understand correctly, the innermost loop can be parallelized using Boost-Compute (or even Bolt) as it is potentially data parallelizable.
The third loop where we eventually compute the \phi is anyway parallelizable.
If we assume data parallelizable, then barring the boundary points, \delta_{i+k}{j+l} can be computed locally. There won't be any need for inter data communication except at the boundary locations. \delta and \omega at these locations can be done serially once inner locations have been computed.
Your other concerns regarding multi-device operations is valid. Hence, a mechanism should be available to use this parallel version (either as an ifdef or some user-specified input). That being said, anything that utilizes any form of parallelizaltion (CPU and GPU) is always welcome.
Do you think I am correct in my assumptions?

skn123 · 2014-05-10T17:18:41Z

I have another question regarding mba_benchmark,cpp
vex::multivector<double, 2> C(ctx, n);
vex::vector Z(ctx, n);

    vex::copy(x, C(0));
    vex::copy(y, C(1));

    prof.tic_cl("interpolate");
    for(size_t i = 0; i < m; ++i)
        Z = surf(C(0), C(1));

Now, is this the only way to copy x and y to the device? Would it be possible to allocate values to C(0) and C(1) directly without having to first allocate them on the host?

ddemidov · 2014-05-10T18:15:53Z

There are no host-allocated structures in the snippet you provided. Both vex::multivector and vex::vector are device structures. And those are allocated directly.

If what you meant to ask is if its possible to initialize newly created device vector with a host vector data, then the answer is yes, it is possible. See the list of vex::vector constructors.

The mba_bechmark does look a bit ugly in this regard. 56e236a fixes that.

ddemidov · 2014-05-10T18:21:12Z

Regarding the MBA algorithm, if you have a closer look at BA algorithm on page 4 of the paper, you'll notice that there is an outer loop over scattered data points (for each point ( x_c , y_c , z_c) in P do), which updates accumulator arrays delta and omega on each iteration at random locations. This is the problem I was talking about earlier which would lead to a data race. It would require the use of atomic operations which could possibly kill the performance gains of parallelization.

skn123 · 2014-05-10T18:35:18Z

Yes, indeed that is the case of (for each point ( x_c , y_c , z_c) in P do) and that is what I meant by "data parallelism". If I were to use a CPU as an example, I would divide the data into k-parts (where k is the # of cores) and run BA on each of the control points within each part, except for the boundary points. This is a classic case of parallelizing region merging algorithms using the "Union-find" method. If you do using atomic, indeed it will kill the performance.
The advantage of spline interpolation is that it only requires points from neighbors (1st 2nd 3rd..etc) and not all the regions. I don't know how to go about doing data parallelism in GPU :)

Now, for your previous comment:
Yes, the mba_benchmark looks much cleaner now. However, my question still remains unanswered. Let me explain my situation. I have a structured image grid having dimensions inSize[0] * inSize[1] * K. I would like to fill up x, y and z with the respective indices.
If I were to do this using OpenMP, I would do it the following way:
#pragma omp parallel for
for (size_t i = 0; i < n; ++i)
{
x[i] = (float)(i % inSize[0]);
y[i] = (float)((i - inSize[0]) / inSize[0]) % inSize[1];
z[i] = (float)mSortedBands[((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1]];
}

where mSortedBands contains index values
Ideally, I should be able to do the whole thing in a Kernel. If that is the case, I would not have to initialize the x, y or z values on the host and directly go to the device.

Any pointers on how one can do this in VexCl? Maybe this is the right time to write my first OpenCL (or should I say, VexCL) kernel :)

Also, why are not freeing the x, y vectors in mba_benchmark? How do the device vectors get deallocated? Are they smart pointers?

ddemidov · 2014-05-10T18:40:21Z

Ok, the question about data allocation is a lot clearer now. You could do this:

vex::vector<float> x(ctx, n), y(ctx, n), z(ctx, n);
auto i = vex::element_index();
x = i % inSize[0];
y = ((i - inSize[0]) / inSize[0]) % inSize[1];

Not sure about mSortedBands, but if it was a vex::vector<int>, you could do

z = vex::permutation(((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1])(mSortedBands);

ddemidov · 2014-05-10T18:47:48Z

Regarding a K-way split of input data, how would you do it on CPU? Would each core skip points that do not belong to its subdomain? Or would you do a sort-by-key first, where key is the subdomain each point belongs to? It seems that on a GPU only second of these options would make sense, but then it has worse algorithmic complexity than the original operation.

skn123 · 2014-05-10T18:49:29Z

Perfect, then the only thing that would be of interest in this implementation would be the aspect of data parallelization. To understand this concept, take a look at Fig 3 in this paper:
http://crd.lbl.gov/assets/pubs_presos/Harrison-EGPGV2011.pdf

ddemidov · 2014-05-10T18:50:44Z

Regarding the deallocation: vex::vectors behave in the same way std::vectors do. They deallocate themselves when going out of scope. That's an example of RAII idiom.

skn123 · 2014-05-10T18:51:41Z

So in the mba_benchmark do I explicitly have to state p.clear(), v.clear() and so on...?

ddemidov · 2014-05-10T18:54:11Z

No, you just let them go out of scope. No memory will leak.

ddemidov · 2014-05-10T19:09:07Z

The technique described in the paper by Harrison et al (and domain decomposition in general) is suitable for fat cluster nodes or CPU cores. This is an example of coarse-grain parallelism. GPUs on the other hand, have fine-grained parallelism, where each thread is assigned to a single data point (e.g. matrix or vector element). So I don't think this approach could be used here.

skn123 · 2014-05-11T04:26:20Z

Thanks for the terms :) I always wondered why we cannot use GPU for coarse grained parallelism. However, the parallelism I am hinting at can (??) be achieved by the process of interleaving. As indicated in Sec 3.2
"...Since each data point in P influences a set of 4 x 4 neighboring control points..."
I need to ensure that when I am running the outer loop, I am only updating those control points which are not influenced by neighbors. The only way this would be possible is to use a multi-grid mechanism. Let us assume I have points x \in [0, M-1], y \in [0, N-1],
then
Grid 1: x_1 \in [0 : 4 : M -1], y_1 \in [0 : 4 : N -1] ,...and so on for each dimension
Grid 2: x_1 \in [1 : 4 : M -1], y_1 \in [1 : 4 : N -1] ,...and so on for each dimension
Grid 3: x_1 \in [2 : 4 : M -1], y_1 \in [2 : 4 : N -1] ,...and so on for each dimension
Grid 4: x_1 \in [3 : 4 : M -1], y_1 \in [3 : 4 : N -1] ,...and so on for each dimension
All the other operations within the main loop are localized (i.e., no communication between processes).
As per your notation, I can (??) run a fine-grained parallelization for Grid_i and then sum them all up (because that's what is actually happening in the innermost loop.
What do you think? Have I interpreted this correctly?

ddemidov · 2014-05-12T05:56:38Z

Could you provide a working openmp prototype for your idea? Just for the main loop over data points on page 4 with some random input.

skn123 · 2014-05-15T06:09:38Z

Denis
I had an even closer look at the paper (after all this) as I was convinced of what I mentioned. Take a look at the last paragraph of Page 3 in the paper (right had column). "...In general, we resolve multiple assignments....". Every control point \phi is dependent on data points in its 4 x 4 neighborhood. Lets take an analogy of median filtering (5 x 5) and as we know, they are embarassingly parallel.
Now, to check this, I browsed through the web and came up with this codebase that is there on github:
https://github.com/SINTEF-Geometry/MBA

Check the functions:

void MBA::BAalg()
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t)

and

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2)

I have extracted all the relevant codes pertaining to BA algorithm only. We are not assuming UNIFORM_CUBIC_C1_SPLINES

for (int ip = 0; ip < noPoints; ip++) 
{

    // Map to the half open domain Omega = [0,m) x [0,n)
    // The mapped uc and vc must be (strictly) less than m and n respectively
    double uc = (data_.U(ip) - data_.umin()) * interval_normalization_factor_u; 
    double vc = (data_.V(ip) - data_.vmin()) * interval_normalization_factor_v;

    int i, j;
    double s, t;
    UCBspl::ijst(m_, n_, uc, vc, i, j, s, t);

    // compute w_kl's and SumSum w_ab^2 here:
    double w_kl[4][4];
    int k,l;
    double sum_w_ab2_inv = 0.0; 
    UCBspl::WKLandSum2(s, t, w_kl, sum_w_ab2_inv);

    sum_w_ab2_inv = double(1) / sum_w_ab2_inv;

    double zc = data_.Z()[ip];

    // check p. 231: k=(i+1) - flor(xc) and l = ...
    for (k = 0; k <= 3; k++) 
  {
      for (l = 0; l <=3; l++) 
    {
      // compute phi_kl with equation (3)        
      double tmp = w_kl[k][l];

      // 1. Originally
      double phi_kl = tmp * zc * sum_w_ab2_inv;

      // 2. Alternatively, to let it tapper of more smoothly (but more efficient if permantly) 
      //double t = 0.8; double phi_kl = (1.0-t)*tmp*zc/sum_w_ab2 + t*zc;

      // 3. Alternatively, with linear term
      // double alpha = 0.01; double phi_kl = (tmp*zc + alpha*(tmp - sum_w_ab2)) / sum_w_ab2;

      // And alternatively for equation (5):
      // from |w_kl|^2 to |w_kl| to get a weighted average
      // just skip the next statement

      tmp *= tmp;

      delta_(i+k,j+l) += tmp*phi_kl;
      omega_(i+k,j+l) += tmp;

      }
    } 
}

// s,t \in [0,1) (but special on gridlines m and n)
// i,j \in [-1, ???
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t) 
{
  //int i = std::min((int)uc - 1, m-2);
  //int j = std::min((int)vc - 1, n-2);

#ifdef  UNIFORM_CUBIC_C1_SPLINES
  i = 2*((int)uc) - 1;
  j = 2*((int)vc) - 1;
#else
  i = (int)uc - 1;
  j = (int)vc - 1;
#endif

  s = uc - floor(uc);
  t = vc - floor(vc);

  // adjust for x or y on gridlines m and n (since impl. has 0 <= x <= m and 0 <= y <= n 
#ifdef  UNIFORM_CUBIC_C1_SPLINES
  if (i == 2*m-1) {
    i-=2;
    s = 1;
  }
  if (j == 2*n-1) {
    j-=2;
    t = 1;
  }
#else
  if (i == m-1) {
    i--;
    s = 1;
  }
  if (j == n-1) {
    j--;
    t = 1;
  }
#endif
}

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 
{
    sum_w_ab2 = 0.0;
    double Bs0 = B_0(s); double Bt0 = B_0(t);
    double Bs1 = B_1(s); double Bt1 = B_1(t);
    double Bs2 = B_2(s); double Bt2 = B_2(t);
    double Bs3 = B_3(s); double Bt3 = B_3(t);

    double tmp;

    // unrolled by Odd Andersen 15. dec. 2003, for optimization
    tmp = Bs0 * Bt0; w_kl[0][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt1; w_kl[0][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt2; w_kl[0][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt3; w_kl[0][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs1 * Bt0; w_kl[1][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt1; w_kl[1][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt2; w_kl[1][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt3; w_kl[1][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs2 * Bt0; w_kl[2][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt1; w_kl[2][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt2; w_kl[2][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt3; w_kl[2][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs3 * Bt0; w_kl[3][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt1; w_kl[3][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt2; w_kl[3][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt3; w_kl[3][3] = tmp; sum_w_ab2 += tmp * tmp;

//     int k,l;
//     sum_w_ab2 = 0.0; 
//     for (k = 0; k <= 3; k++) {
//  for (l = 0; l <=3; l++) {
//      double tmp = w(k, l, s, t);
//      w_kl[k][l] = tmp;
//      sum_w_ab2 += (tmp*tmp); 
//  }
//     }
}

ddemidov · 2014-05-15T06:30:54Z

I was not saying that you are wrong, I merely asked a working prototype of
your parallelization idea to see what effect on performance and memory
footprint it would make. I don't see it in the code snippet you provided.

Cheers,
Denis

skn123 · 2014-05-15T06:32:46Z

I will take a crack at it. I would not, however be able to translate this it to a VexCL format, but I think I can make it using OpenMP.

ddemidov · 2014-05-15T06:34:41Z

An openmp implementation should be enough. Also, you don't need to implement the full BA algorithm. A parallelization of the loop on p.4 is enough.

skn123 · 2014-05-15T11:42:34Z

Meanwhile let me try this with OpenMP

skn123 · 2014-05-16T09:08:28Z

Here is the timing when using the current version of mba_benchmark
E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe
size : 33554432

Cayman (AMD Accelerated Parallel Processing)

   Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro

cessing)

^C
E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe
size : 33554432

Cayman (AMD Accelerated Parallel Processing)

   Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro

cessing)

surf(0.5, 0.5, 0.5) = -0.000214423

Profile: 1105.916 sec.
generate data: 3.840 sec.
GPU: 1102.076 sec.
setup: 1098.016 sec.
interpolate: 3.530 sec.

E:\Binaries_MinGW\vexcl\examples>

The data size = 16 * 2048 * 1024

skn123 · 2014-05-16T09:20:47Z

So it would be interesting to see if the setup time can be reduced. I am working on the OpenMP implementation but it will be different from your C++11 based stuff.

ddemidov · 2014-05-16T09:25:16Z

But that's timing for VexCL's benchmark. I was more interested in timings for your own problem (I was secretly hoping that setup takes negligible fraction of time for your use case).

skn123 · 2014-05-16T09:28:13Z

My problem is a little bit more involved. First of all a couple of questions:
a.) In the original paper, we have the variables "s, t", which I presume lie between 0,1.
b.) If that is the case, then we see that they are computed from x and y as s = x_c - \floor(x_c). How would that be possible if we have integer grid points? I can get away for i,j (in \phi_i,j ) but for s,t I would need something else. That is what I am trying to figure out. Any thoughts?

ddemidov · 2014-05-16T09:36:32Z

The grid has non unit spacing, so some scaling is done before computing s and t. Those are stored in array s by the way, because mba is not restricted to two dimensions. The comutation is performed by these lines.

skn123 · 2014-05-16T09:56:03Z

But, suppose if I have grid that are indeed integers; lets say in my case [0..15] X [0..2047] X [0..1023], then s[d] where d = [0, 1,...NDIM] would be [0,1/16,2/16,....15/16] X [0,1/2048,2/2048,...,2047/2048] X [0,1/1024,2/1024,....,1023/1024]
Is that correct?

ddemidov · 2014-05-16T10:20:58Z

I don't understand that. s is computed for each data point and is equal to a fraction of grid step separating each point from a grid line to the left.

skn123 · 2014-05-16T10:31:00Z

And this is what I am not understanding. So, if I have my grid that is uniform, how could one build those numbers for s[d] ? I think that is the only problem that I am facing now.

ddemidov · 2014-05-16T10:41:57Z

Look again at the lines I quoted before. p contains real coordinates of data points, the grid goes from pmin to pmax, hinv contains inverted grid step for each dimension.

First line scales p so that it becomes aligned with a unit-step grid. Then s is computed as u - floor(u).

In case the grid has integer coordinates (note that this won't be the case on subsequent hierarchy levels anyway), hinv[d] == 1 and pmin[d] == 0.

skn123 · 2014-05-16T10:54:55Z

Yes, I did see those lines. The problem is the grid that you provide in mba_benchmark is also between [0,1](when you call the grid creation module). So is it mandatory for me to normalize the grid values between 0,1 which is why I have asked you the question: Should I normalize my actual grid coordinates to lie between 0,1 to conform with the mba architecture or can it work for integer grid locations.

ddemidov · 2014-05-16T10:56:38Z

The grid in benchmark is just an example. [0, 1] looked like a nice interval to me. It could be [-7.6, 42] or [0, 1000].

skn123 · 2014-05-16T11:21:28Z

One more question here: In your mba_benchmark, you assume that all points in the grid need to be interpolated, whereby all points in the grid form part of the lattice. Is this the mandatory way of implementing this? I think the # of control points should (generally) be much less that the actual # of points on the grid.

ddemidov · 2014-05-16T11:24:54Z

There are no restrictions on structure or number of control points or interpolation points. Benchmark is just a benchmark; it is there just to do a stress test of the algorithm.

ddemidov · 2014-05-20T12:34:04Z

Note to myself: each control point updates m = 4^ndim spline coefficients. In case there are few control points, this could be done by allocating m * npoints of temporary storage, then reduce_by_key after doing parallel updates (where key is id of spline coefficient).

skn123 · 2014-06-21T16:03:02Z

I am facing a problem running vex::mba in parallel threads. The crash happens when I am doing the Z = surf(C(0),C(1)). This means that I cannot run concurrent versions of interpolation on the GPU from multiple host threads. In that case, If I were to serialize this, I would need to save the spline coefficients. However, the API that is provided for MBA does not allow me to do this. Is there a way out of this conundrum?
On a separate note; the algorithm does not scale up to 3D elements and the results are coming out incorrect.

ddemidov · 2014-06-21T17:10:02Z

Running interpolation on the same GPU from multiple threads does not make sense performance-wise, since OpenCL will serialize the kernels anyway. Moreover, it is not safe (see notes about thread safety in OpenCL specification) and you should guard calls to mba::operator().

Applying an instance of MBA on GPU other than was used for construction is not supported.

Regarding the incorrect results in 3D: I have mostly used mba for 3D problems, and it worked for me. Could you please provide a simple test case?

skn123 · 2014-06-21T17:23:37Z

So, thats why I said I will serialize the interpolation operation. However, as the MBA algorithm runs on the CPU I can potentially parallelize it. For that, I would need to create a
std::vectorvex::mba operators. However, there is no API to create an empty instance of vex::mba.

I will provide you with a set of 3D control as well as test points.

ddemidov · 2014-06-21T17:29:24Z

Only constructor of MBA runs on CPU. The interpolation itself runs on GPU (or, more generally, a compute device).

You could use a std::vector< std::shared_ptr<vex::mba> >, or boost::ptr_vector< vex::mba>. Second option has the advantage that it provides reference semantics (so that you could write mba[i](X, Y) instead of (*mba[i])(X, Y)).

skn123 · 2014-06-21T17:33:43Z

Maybe, this should be documented somewhere in the code.

ddemidov · 2014-06-21T17:35:23Z

What exactly? The use of smart pointers with a class that does not provide a default constructor?

skn123 · 2014-06-21T17:39:03Z

Yes; Especially for this example. I am familiar with std::shared_ptr. I will have to look into boost::ptr_vector and see if that is more beneficial.

skn123 · 2014-06-22T12:37:18Z

I tried it with std::shared_ptr and it apparently worked. However, I think the vex::mba is being overwritten

// This defines a context to be used by VexCL
static vex::Context ctx( vex::Filter::Env );
// A typedef for the routine to be called from vexcl
typedef vex::mba<2, double> mba_type;
// A vector of such mba modules
static std::vector< std::shared_ptr< mba_type > > mba_vector(cBands_Num, NULL);

and the calling loops are:

//#ifdef USE_OPENMP
//#pragma omp parallel for
//#endif
  for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
     ....//The image to be interpolated pImage
    // 4.) Run the MBA pipeline on this slice
    MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
  }

  // In this loop we use the spline lattices to interpolate missing pixels. This
  // step is done serially as the interpolation is effected on the GPU.
  for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
    std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
    // 5.) Run the MBA pipeline on this slice
    MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
    .....//The interpolated image
  }

and the place where I build the spline lattice

....
  //////////////////////////////////////////////////////////////////////////////
  //From this point, we interface it with vexcl/mba.hpp
  mba_vector[sIdx] = std::make_shared<mba_type>(ctx,
                                             make_array2<double>(-0.01, -0.01),
                                             make_array2<double>(1.01, 1.01),
                                             p, v, make_array2<size_t>(2, 2));

I can compute the spline lattice for all slices in parallel and the do the interpolation serially.

ddemidov · 2014-06-22T12:51:33Z

This looks mostly valid. Why do you make ctx and especially mba_vector static? Is this inside a function that gets called several times? Can you guarantee that cBands_num won't change between calls?

Also, what do you mean by 'vex::mba is being overwritten'?

skn123 · 2014-06-22T12:56:39Z

Static is because I do not want to errors related to "multiple definitions". Now, I did the other way as

for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
     ....//The image to be interpolated pImage
    // 4.) Run the MBA pipeline on this slice
    MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
  //}

  // In this loop we use the spline lattices to interpolate missing pixels. This
  // step is done serially as the interpolation is effected on the GPU.
 // for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  //{
    std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
    // 5.) Run the MBA pipeline on this slice
    MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
    .....//The interpolated image
  }

and it works like a breeze! Can you see if you can reproduce this at your end? I think there is some issue with storing previously generated lattices. Yes, all the other factors that you have mentioned are stable.
I even tried this by removing the OPENMP directive and it still doesnt work. The only way it works is if I run interpolation right after build lattice.

skn123 · 2014-06-22T12:59:25Z

Ignore these comments...My mistake..I was not updating the slice to be interpolated and it stayed on with the last slice after the build lattice loop :)

ddemidov · 2014-06-22T13:00:04Z

Could write a bit more clearly what does and what does not work?

skn123 · 2014-06-22T13:28:14Z

The variable "pImage" in the code above corresponds to an image slice. I was not updating that in the interpolation phase. Now, would / can there be a way to run interpolation for multiple spline lattices and multiple C(0) and C(1) rather than having to do it the serial way?

ddemidov · 2014-06-22T13:38:07Z

As I said before, applying several instances of a vex::mba from several host threads on the same compute device(s) does not make sense and is not safe. OpenCL implementation will serialize the compute kernels anyway. I think you can not get better than what you have now: parallel construction of several interpolators with multiple host threads, and serial application of the interpolators with a GPU.

ddemidov added the enhancement label May 4, 2014

ddemidov added a commit that referenced this issue May 9, 2014

Parallelize mba setup with OpenMP

c217b22

See #126

ddemidov added a commit that referenced this issue May 9, 2014

Replace critical section with atomic in mba setup

ecad92c

refs #126

ddemidov added the todo label May 16, 2014

Run "mba" in OpenCL #126

Run "mba" in OpenCL #126

Comments

skn123 commented May 4, 2014

ddemidov commented May 4, 2014

skn123 commented May 4, 2014

ddemidov commented May 4, 2014

skn123 commented May 4, 2014

ddemidov commented May 4, 2014

skn123 commented May 7, 2014

ddemidov commented May 7, 2014

ddemidov commented May 9, 2014

ddemidov commented May 9, 2014

skn123 commented May 10, 2014

skn123 commented May 10, 2014

ddemidov commented May 10, 2014

ddemidov commented May 10, 2014

skn123 commented May 10, 2014

ddemidov commented May 10, 2014

ddemidov commented May 10, 2014

skn123 commented May 10, 2014

ddemidov commented May 10, 2014

skn123 commented May 10, 2014

ddemidov commented May 10, 2014

ddemidov commented May 10, 2014

skn123 commented May 11, 2014

ddemidov commented May 12, 2014

skn123 commented May 15, 2014

ddemidov commented May 15, 2014

skn123 commented May 15, 2014

ddemidov commented May 15, 2014

skn123 commented May 15, 2014

skn123 commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

skn123 commented May 16, 2014

ddemidov commented May 16, 2014

ddemidov commented May 20, 2014

skn123 commented Jun 21, 2014

ddemidov commented Jun 21, 2014

skn123 commented Jun 21, 2014

ddemidov commented Jun 21, 2014

skn123 commented Jun 21, 2014

ddemidov commented Jun 21, 2014

skn123 commented Jun 21, 2014

skn123 commented Jun 22, 2014

ddemidov commented Jun 22, 2014

skn123 commented Jun 22, 2014

skn123 commented Jun 22, 2014

ddemidov commented Jun 22, 2014

skn123 commented Jun 22, 2014

ddemidov commented Jun 22, 2014