Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run "mba" in OpenCL #126

Open
skn123 opened this issue May 4, 2014 · 74 comments
Open

Run "mba" in OpenCL #126

skn123 opened this issue May 4, 2014 · 74 comments

Comments

@skn123
Copy link

skn123 commented May 4, 2014

Currently in mba_benchmark.cpp, we see that the Spline creation module is run in CPU while the interpolation is done in GPU. This makes sense. However, would it be possible to make this portion (i.e., Spline creation module) work in GPU?

@ddemidov
Copy link
Owner

ddemidov commented May 4, 2014

I don't remember all the implementation details right away, but it seems it would be possible to do the setup phase in OpenCL. It could even make sense because all underlying structures are regular.

It will, however, take some time, because I am a bit busy at the moment.

@skn123
Copy link
Author

skn123 commented May 4, 2014

Denis, The setup was done using lot of C++11 code :) That's why I was having a hard time understanding it, but I have a suspicion that it can be done. Let me know if I can help in any way.

@ddemidov
Copy link
Owner

ddemidov commented May 4, 2014

I think it should be enough to implement the control lattice structure with OpenCL. See the referenced paper for the details of the algorithm.

The biggest problem is that it needs to be done in generic way w.r.t. the number of dimensions, so one would need to do some OpenCL code generation.

@skn123
Copy link
Author

skn123 commented May 4, 2014

Why not take the FFT route? Handcraft those kernels for 1,2 and 3D and make it work on the device for these 3 dimensions. For all other dimensions, it can continue to use the current Host option?

@ddemidov
Copy link
Owner

ddemidov commented May 4, 2014

I don't like the idea of keeping separate (but very similar) kernels when they all may be generated from a single source.

@skn123
Copy link
Author

skn123 commented May 7, 2014

If you are referring to the basic BA algorithm in the paper, then there are loops that can easily be "OpenMP-fied" even in the current implementation.

@ddemidov
Copy link
Owner

ddemidov commented May 7, 2014

Yes. Feel free to provide a pull request :).

ddemidov added a commit that referenced this issue May 9, 2014
@ddemidov
Copy link
Owner

ddemidov commented May 9, 2014

So I looked at mba implementation a bit closer. Now I know why I decided to stay on the CPU for the initialization.

First, and least important, VexCL supports parallel work with multiple compute devices. Since MBA may take a random set of coordinates to get interpolated values at, each device has to hold a complete copy of the control lattice phi. It did not seem right to either duplicate the initialization work for each device, or do initialization on the first device and then transfer the lattice to the rest of the devices.

Second, in the initialization loop over the data points here temporary arrays delta and omega are updated at some random positions. This could lead to data races both with OpenMP and OpenCL parallelizations and would require use of atomic operations. This would negatively affect performance.

For example, take a look at c217b22. Here are results of mba_benchmark with and without openmp parallelization:

No OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.889 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.53%)
[  GPU:               3.711 sec.] ( 41.75%)
[    setup:           2.320 sec.] ( 26.10%)
[    interpolate:     1.385 sec.] ( 15.58%)
[  CPU:               5.131 sec.] ( 57.72%)
[    setup:           2.324 sec.] ( 26.14%)
[    interpolate:     2.806 sec.] ( 31.57%)

OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:            17.936 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.26%)
[  GPU:              12.662 sec.] ( 70.60%)
[    setup:          11.137 sec.] ( 62.09%)
[    interpolate:     1.519 sec.] (  8.47%)
[  CPU:               5.226 sec.] ( 29.14%)
[    setup:           2.322 sec.] ( 12.95%)
[    interpolate:     2.903 sec.] ( 16.19%)

Note that setup now takes 11 seconds instead of 2.3.

So unless I did something wrong here, it seems its better to leave the current MBA setup as it is.

@ddemidov
Copy link
Owner

ddemidov commented May 9, 2014

Ok, I did something wrong here. After replacing critical section with atomic in ecad92c. mba_bechmark output looks like this:

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.834 sec.] (100.00%)
[  generate data:     0.046 sec.] (  0.52%)
[  GPU:               3.642 sec.] ( 41.23%)
[    setup:           2.251 sec.] ( 25.48%)
[    interpolate:     1.386 sec.] ( 15.69%)
[  CPU:               5.145 sec.] ( 58.25%)
[    setup:           2.327 sec.] ( 26.34%)
[    interpolate:     2.818 sec.] ( 31.90%)

This is better than serial version, but only slightly. This also introduces requirement that iterators to coordinates and values of data points are pointing at continuous chunks of memory. I am not sure the neglectable speedup worth it. What do you think?

@skn123
Copy link
Author

skn123 commented May 10, 2014

If I look at BA algorithm in the paper, there are three main loops.
If I understand correctly, the innermost loop can be parallelized using Boost-Compute (or even Bolt) as it is potentially data parallelizable.
The third loop where we eventually compute the \phi is anyway parallelizable.
If we assume data parallelizable, then barring the boundary points, \delta_{i+k}{j+l} can be computed locally. There won't be any need for inter data communication except at the boundary locations. \delta and \omega at these locations can be done serially once inner locations have been computed.
Your other concerns regarding multi-device operations is valid. Hence, a mechanism should be available to use this parallel version (either as an ifdef or some user-specified input). That being said, anything that utilizes any form of parallelizaltion (CPU and GPU) is always welcome.
Do you think I am correct in my assumptions?

@skn123
Copy link
Author

skn123 commented May 10, 2014

I have another question regarding mba_benchmark,cpp
vex::multivector<double, 2> C(ctx, n);
vex::vector Z(ctx, n);

    vex::copy(x, C(0));
    vex::copy(y, C(1));

    prof.tic_cl("interpolate");
    for(size_t i = 0; i < m; ++i)
        Z = surf(C(0), C(1));

Now, is this the only way to copy x and y to the device? Would it be possible to allocate values to C(0) and C(1) directly without having to first allocate them on the host?

@ddemidov
Copy link
Owner

There are no host-allocated structures in the snippet you provided. Both vex::multivector and vex::vector are device structures. And those are allocated directly.

If what you meant to ask is if its possible to initialize newly created device vector with a host vector data, then the answer is yes, it is possible. See the list of vex::vector constructors.

The mba_bechmark does look a bit ugly in this regard. 56e236a fixes that.

@ddemidov
Copy link
Owner

Regarding the MBA algorithm, if you have a closer look at BA algorithm on page 4 of the paper, you'll notice that there is an outer loop over scattered data points (for each point ( x_c , y_c , z_c) in P do), which updates accumulator arrays delta and omega on each iteration at random locations. This is the problem I was talking about earlier which would lead to a data race. It would require the use of atomic operations which could possibly kill the performance gains of parallelization.

@skn123
Copy link
Author

skn123 commented May 10, 2014

Yes, indeed that is the case of (for each point ( x_c , y_c , z_c) in P do) and that is what I meant by "data parallelism". If I were to use a CPU as an example, I would divide the data into k-parts (where k is the # of cores) and run BA on each of the control points within each part, except for the boundary points. This is a classic case of parallelizing region merging algorithms using the "Union-find" method. If you do using atomic, indeed it will kill the performance.
The advantage of spline interpolation is that it only requires points from neighbors (1st 2nd 3rd..etc) and not all the regions. I don't know how to go about doing data parallelism in GPU :)

Now, for your previous comment:
Yes, the mba_benchmark looks much cleaner now. However, my question still remains unanswered. Let me explain my situation. I have a structured image grid having dimensions inSize[0] * inSize[1] * K. I would like to fill up x, y and z with the respective indices.
If I were to do this using OpenMP, I would do it the following way:
#pragma omp parallel for
for (size_t i = 0; i < n; ++i)
{
x[i] = (float)(i % inSize[0]);
y[i] = (float)((i - inSize[0]) / inSize[0]) % inSize[1];
z[i] = (float)mSortedBands[((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1]];
}

where mSortedBands contains index values
Ideally, I should be able to do the whole thing in a Kernel. If that is the case, I would not have to initialize the x, y or z values on the host and directly go to the device.

Any pointers on how one can do this in VexCl? Maybe this is the right time to write my first OpenCL (or should I say, VexCL) kernel :)

Also, why are not freeing the x, y vectors in mba_benchmark? How do the device vectors get deallocated? Are they smart pointers?

@ddemidov
Copy link
Owner

Ok, the question about data allocation is a lot clearer now. You could do this:

vex::vector<float> x(ctx, n), y(ctx, n), z(ctx, n);
auto i = vex::element_index();
x = i % inSize[0];
y = ((i - inSize[0]) / inSize[0]) % inSize[1];

Not sure about mSortedBands, but if it was a vex::vector<int>, you could do

z = vex::permutation(((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1])(mSortedBands);

@ddemidov
Copy link
Owner

Regarding a K-way split of input data, how would you do it on CPU? Would each core skip points that do not belong to its subdomain? Or would you do a sort-by-key first, where key is the subdomain each point belongs to? It seems that on a GPU only second of these options would make sense, but then it has worse algorithmic complexity than the original operation.

@skn123
Copy link
Author

skn123 commented May 10, 2014

Perfect, then the only thing that would be of interest in this implementation would be the aspect of data parallelization. To understand this concept, take a look at Fig 3 in this paper:
http://crd.lbl.gov/assets/pubs_presos/Harrison-EGPGV2011.pdf

@ddemidov
Copy link
Owner

Regarding the deallocation: vex::vectors behave in the same way std::vectors do. They deallocate themselves when going out of scope. That's an example of RAII idiom.

@skn123
Copy link
Author

skn123 commented May 10, 2014

So in the mba_benchmark do I explicitly have to state p.clear(), v.clear() and so on...?

@ddemidov
Copy link
Owner

No, you just let them go out of scope. No memory will leak.

@ddemidov
Copy link
Owner

The technique described in the paper by Harrison et al (and domain decomposition in general) is suitable for fat cluster nodes or CPU cores. This is an example of coarse-grain parallelism. GPUs on the other hand, have fine-grained parallelism, where each thread is assigned to a single data point (e.g. matrix or vector element). So I don't think this approach could be used here.

@skn123
Copy link
Author

skn123 commented May 11, 2014

Thanks for the terms :) I always wondered why we cannot use GPU for coarse grained parallelism. However, the parallelism I am hinting at can (??) be achieved by the process of interleaving. As indicated in Sec 3.2
"...Since each data point in P influences a set of 4 x 4 neighboring control points..."
I need to ensure that when I am running the outer loop, I am only updating those control points which are not influenced by neighbors. The only way this would be possible is to use a multi-grid mechanism. Let us assume I have points x \in [0, M-1], y \in [0, N-1],
then
Grid 1: x_1 \in [0 : 4 : M -1], y_1 \in [0 : 4 : N -1] ,...and so on for each dimension
Grid 2: x_1 \in [1 : 4 : M -1], y_1 \in [1 : 4 : N -1] ,...and so on for each dimension
Grid 3: x_1 \in [2 : 4 : M -1], y_1 \in [2 : 4 : N -1] ,...and so on for each dimension
Grid 4: x_1 \in [3 : 4 : M -1], y_1 \in [3 : 4 : N -1] ,...and so on for each dimension
All the other operations within the main loop are localized (i.e., no communication between processes).
As per your notation, I can (??) run a fine-grained parallelization for Grid_i and then sum them all up (because that's what is actually happening in the innermost loop.
What do you think? Have I interpreted this correctly?

@ddemidov
Copy link
Owner

Could you provide a working openmp prototype for your idea? Just for the main loop over data points on page 4 with some random input.

@skn123
Copy link
Author

skn123 commented May 15, 2014

Denis
I had an even closer look at the paper (after all this) as I was convinced of what I mentioned. Take a look at the last paragraph of Page 3 in the paper (right had column). "...In general, we resolve multiple assignments....". Every control point \phi is dependent on data points in its 4 x 4 neighborhood. Lets take an analogy of median filtering (5 x 5) and as we know, they are embarassingly parallel.
Now, to check this, I browsed through the web and came up with this codebase that is there on github:
https://github.com/SINTEF-Geometry/MBA

Check the functions:

void MBA::BAalg()
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t)

and

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 

I have extracted all the relevant codes pertaining to BA algorithm only. We are not assuming UNIFORM_CUBIC_C1_SPLINES

for (int ip = 0; ip < noPoints; ip++) 
{

    // Map to the half open domain Omega = [0,m) x [0,n)
    // The mapped uc and vc must be (strictly) less than m and n respectively
    double uc = (data_.U(ip) - data_.umin()) * interval_normalization_factor_u; 
    double vc = (data_.V(ip) - data_.vmin()) * interval_normalization_factor_v;

    int i, j;
    double s, t;
    UCBspl::ijst(m_, n_, uc, vc, i, j, s, t);

    // compute w_kl's and SumSum w_ab^2 here:
    double w_kl[4][4];
    int k,l;
    double sum_w_ab2_inv = 0.0; 
    UCBspl::WKLandSum2(s, t, w_kl, sum_w_ab2_inv);

    sum_w_ab2_inv = double(1) / sum_w_ab2_inv;

    double zc = data_.Z()[ip];

    // check p. 231: k=(i+1) - flor(xc) and l = ...
    for (k = 0; k <= 3; k++) 
  {
      for (l = 0; l <=3; l++) 
    {
      // compute phi_kl with equation (3)        
      double tmp = w_kl[k][l];

      // 1. Originally
      double phi_kl = tmp * zc * sum_w_ab2_inv;

      // 2. Alternatively, to let it tapper of more smoothly (but more efficient if permantly) 
      //double t = 0.8; double phi_kl = (1.0-t)*tmp*zc/sum_w_ab2 + t*zc;

      // 3. Alternatively, with linear term
      // double alpha = 0.01; double phi_kl = (tmp*zc + alpha*(tmp - sum_w_ab2)) / sum_w_ab2;

      // And alternatively for equation (5):
      // from |w_kl|^2 to |w_kl| to get a weighted average
      // just skip the next statement

      tmp *= tmp;

      delta_(i+k,j+l) += tmp*phi_kl;
      omega_(i+k,j+l) += tmp;

      }
    } 
}

// s,t \in [0,1) (but special on gridlines m and n)
// i,j \in [-1, ???
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t) 
{
  //int i = std::min((int)uc - 1, m-2);
  //int j = std::min((int)vc - 1, n-2);

#ifdef  UNIFORM_CUBIC_C1_SPLINES
  i = 2*((int)uc) - 1;
  j = 2*((int)vc) - 1;
#else
  i = (int)uc - 1;
  j = (int)vc - 1;
#endif

  s = uc - floor(uc);
  t = vc - floor(vc);

  // adjust for x or y on gridlines m and n (since impl. has 0 <= x <= m and 0 <= y <= n 
#ifdef  UNIFORM_CUBIC_C1_SPLINES
  if (i == 2*m-1) {
    i-=2;
    s = 1;
  }
  if (j == 2*n-1) {
    j-=2;
    t = 1;
  }
#else
  if (i == m-1) {
    i--;
    s = 1;
  }
  if (j == n-1) {
    j--;
    t = 1;
  }
#endif
}

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 
{
    sum_w_ab2 = 0.0;
    double Bs0 = B_0(s); double Bt0 = B_0(t);
    double Bs1 = B_1(s); double Bt1 = B_1(t);
    double Bs2 = B_2(s); double Bt2 = B_2(t);
    double Bs3 = B_3(s); double Bt3 = B_3(t);

    double tmp;

    // unrolled by Odd Andersen 15. dec. 2003, for optimization
    tmp = Bs0 * Bt0; w_kl[0][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt1; w_kl[0][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt2; w_kl[0][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt3; w_kl[0][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs1 * Bt0; w_kl[1][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt1; w_kl[1][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt2; w_kl[1][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt3; w_kl[1][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs2 * Bt0; w_kl[2][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt1; w_kl[2][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt2; w_kl[2][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt3; w_kl[2][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs3 * Bt0; w_kl[3][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt1; w_kl[3][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt2; w_kl[3][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt3; w_kl[3][3] = tmp; sum_w_ab2 += tmp * tmp;

//     int k,l;
//     sum_w_ab2 = 0.0; 
//     for (k = 0; k <= 3; k++) {
//  for (l = 0; l <=3; l++) {
//      double tmp = w(k, l, s, t);
//      w_kl[k][l] = tmp;
//      sum_w_ab2 += (tmp*tmp); 
//  }
//     }
}

@ddemidov
Copy link
Owner

I was not saying that you are wrong, I merely asked a working prototype of
your parallelization idea to see what effect on performance and memory
footprint it would make. I don't see it in the code snippet you provided.

Cheers,
Denis

@skn123
Copy link
Author

skn123 commented May 15, 2014

I will take a crack at it. I would not, however be able to translate this it to a VexCL format, but I think I can make it using OpenMP.

@ddemidov
Copy link
Owner

An openmp implementation should be enough. Also, you don't need to implement the full BA algorithm. A parallelization of the loop on p.4 is enough.

@skn123
Copy link
Author

skn123 commented May 15, 2014

Meanwhile let me try this with OpenMP

@skn123
Copy link
Author

skn123 commented May 16, 2014

Here is the timing when using the current version of mba_benchmark
E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe
size : 33554432

  1. Cayman (AMD Accelerated Parallel Processing)
  2.    Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro
    

cessing)

^C
E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe
size : 33554432

  1. Cayman (AMD Accelerated Parallel Processing)
  2.    Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro
    

cessing)

surf(0.5, 0.5, 0.5) = -0.000214423

Profile: 1105.916 sec.
generate data: 3.840 sec.
GPU: 1102.076 sec.
setup: 1098.016 sec.
interpolate: 3.530 sec.

E:\Binaries_MinGW\vexcl\examples>

The data size = 16 * 2048 * 1024

@skn123
Copy link
Author

skn123 commented May 16, 2014

So it would be interesting to see if the setup time can be reduced. I am working on the OpenMP implementation but it will be different from your C++11 based stuff.

@ddemidov
Copy link
Owner

But that's timing for VexCL's benchmark. I was more interested in timings for your own problem (I was secretly hoping that setup takes negligible fraction of time for your use case).

@skn123
Copy link
Author

skn123 commented May 16, 2014

My problem is a little bit more involved. First of all a couple of questions:
a.) In the original paper, we have the variables "s, t", which I presume lie between 0,1.
b.) If that is the case, then we see that they are computed from x and y as s = x_c - \floor(x_c). How would that be possible if we have integer grid points? I can get away for i,j (in \phi_i,j ) but for s,t I would need something else. That is what I am trying to figure out. Any thoughts?

@ddemidov
Copy link
Owner

The grid has non unit spacing, so some scaling is done before computing s and t. Those are stored in array s by the way, because mba is not restricted to two dimensions. The comutation is performed by these lines.

@skn123
Copy link
Author

skn123 commented May 16, 2014

But, suppose if I have grid that are indeed integers; lets say in my case [0..15] X [0..2047] X [0..1023], then s[d] where d = [0, 1,...NDIM] would be [0,1/16,2/16,....15/16] X [0,1/2048,2/2048,...,2047/2048] X [0,1/1024,2/1024,....,1023/1024]
Is that correct?

@ddemidov
Copy link
Owner

I don't understand that. s is computed for each data point and is equal to a fraction of grid step separating each point from a grid line to the left.

st

@skn123
Copy link
Author

skn123 commented May 16, 2014

And this is what I am not understanding. So, if I have my grid that is uniform, how could one build those numbers for s[d] ? I think that is the only problem that I am facing now.

@ddemidov
Copy link
Owner

Look again at the lines I quoted before. p contains real coordinates of data points, the grid goes from pmin to pmax, hinv contains inverted grid step for each dimension.

First line scales p so that it becomes aligned with a unit-step grid. Then s is computed as u - floor(u).

In case the grid has integer coordinates (note that this won't be the case on subsequent hierarchy levels anyway), hinv[d] == 1 and pmin[d] == 0.

@skn123
Copy link
Author

skn123 commented May 16, 2014

Yes, I did see those lines. The problem is the grid that you provide in mba_benchmark is also between [0,1](when you call the grid creation module). So is it mandatory for me to normalize the grid values between 0,1 which is why I have asked you the question: Should I normalize my actual grid coordinates to lie between 0,1 to conform with the mba architecture or can it work for integer grid locations.

@ddemidov
Copy link
Owner

The grid in benchmark is just an example. [0, 1] looked like a nice interval to me. It could be [-7.6, 42] or [0, 1000].

@skn123
Copy link
Author

skn123 commented May 16, 2014

One more question here: In your mba_benchmark, you assume that all points in the grid need to be interpolated, whereby all points in the grid form part of the lattice. Is this the mandatory way of implementing this? I think the # of control points should (generally) be much less that the actual # of points on the grid.

@ddemidov
Copy link
Owner

There are no restrictions on structure or number of control points or interpolation points. Benchmark is just a benchmark; it is there just to do a stress test of the algorithm.

@ddemidov ddemidov added the todo label May 16, 2014
@ddemidov
Copy link
Owner

Note to myself: each control point updates m = 4^ndim spline coefficients. In case there are few control points, this could be done by allocating m * npoints of temporary storage, then reduce_by_key after doing parallel updates (where key is id of spline coefficient).

@skn123
Copy link
Author

skn123 commented Jun 21, 2014

I am facing a problem running vex::mba in parallel threads. The crash happens when I am doing the Z = surf(C(0),C(1)). This means that I cannot run concurrent versions of interpolation on the GPU from multiple host threads. In that case, If I were to serialize this, I would need to save the spline coefficients. However, the API that is provided for MBA does not allow me to do this. Is there a way out of this conundrum?
On a separate note; the algorithm does not scale up to 3D elements and the results are coming out incorrect.

@ddemidov
Copy link
Owner

Running interpolation on the same GPU from multiple threads does not make sense performance-wise, since OpenCL will serialize the kernels anyway. Moreover, it is not safe (see notes about thread safety in OpenCL specification) and you should guard calls to mba::operator().

Applying an instance of MBA on GPU other than was used for construction is not supported.

Regarding the incorrect results in 3D: I have mostly used mba for 3D problems, and it worked for me. Could you please provide a simple test case?

@skn123
Copy link
Author

skn123 commented Jun 21, 2014

So, thats why I said I will serialize the interpolation operation. However, as the MBA algorithm runs on the CPU I can potentially parallelize it. For that, I would need to create a
std::vectorvex::mba operators. However, there is no API to create an empty instance of vex::mba.

I will provide you with a set of 3D control as well as test points.

@ddemidov
Copy link
Owner

Only constructor of MBA runs on CPU. The interpolation itself runs on GPU (or, more generally, a compute device).

You could use a std::vector< std::shared_ptr<vex::mba> >, or boost::ptr_vector< vex::mba>. Second option has the advantage that it provides reference semantics (so that you could write mba[i](X, Y) instead of (*mba[i])(X, Y)).

@skn123
Copy link
Author

skn123 commented Jun 21, 2014

Maybe, this should be documented somewhere in the code.

@ddemidov
Copy link
Owner

What exactly? The use of smart pointers with a class that does not provide a default constructor?

@skn123
Copy link
Author

skn123 commented Jun 21, 2014

Yes; Especially for this example. I am familiar with std::shared_ptr. I will have to look into boost::ptr_vector and see if that is more beneficial.

@skn123
Copy link
Author

skn123 commented Jun 22, 2014

I tried it with std::shared_ptr and it apparently worked. However, I think the vex::mba is being overwritten

// This defines a context to be used by VexCL
static vex::Context ctx( vex::Filter::Env );
// A typedef for the routine to be called from vexcl
typedef vex::mba<2, double> mba_type;
// A vector of such mba modules
static std::vector< std::shared_ptr< mba_type > > mba_vector(cBands_Num, NULL);

and the calling loops are:

//#ifdef USE_OPENMP
//#pragma omp parallel for
//#endif
  for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
     ....//The image to be interpolated pImage
    // 4.) Run the MBA pipeline on this slice
    MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
  }

  // In this loop we use the spline lattices to interpolate missing pixels. This
  // step is done serially as the interpolation is effected on the GPU.
  for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
    std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
    // 5.) Run the MBA pipeline on this slice
    MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
    .....//The interpolated image
  }

and the place where I build the spline lattice

....
  //////////////////////////////////////////////////////////////////////////////
  //From this point, we interface it with vexcl/mba.hpp
  mba_vector[sIdx] = std::make_shared<mba_type>(ctx,
                                             make_array2<double>(-0.01, -0.01),
                                             make_array2<double>(1.01, 1.01),
                                             p, v, make_array2<size_t>(2, 2));

I can compute the spline lattice for all slices in parallel and the do the interpolation serially.

@ddemidov
Copy link
Owner

This looks mostly valid. Why do you make ctx and especially mba_vector static? Is this inside a function that gets called several times? Can you guarantee that cBands_num won't change between calls?

Also, what do you mean by 'vex::mba is being overwritten'?

@skn123
Copy link
Author

skn123 commented Jun 22, 2014

Static is because I do not want to errors related to "multiple definitions". Now, I did the other way as

for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  {
     ....//The image to be interpolated pImage
    // 4.) Run the MBA pipeline on this slice
    MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
  //}

  // In this loop we use the spline lattices to interpolate missing pixels. This
  // step is done serially as the interpolation is effected on the GPU.
 // for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
  //{
    std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
    // 5.) Run the MBA pipeline on this slice
    MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
    .....//The interpolated image
  }

and it works like a breeze! Can you see if you can reproduce this at your end? I think there is some issue with storing previously generated lattices. Yes, all the other factors that you have mentioned are stable.
I even tried this by removing the OPENMP directive and it still doesnt work. The only way it works is if I run interpolation right after build lattice.

@skn123
Copy link
Author

skn123 commented Jun 22, 2014

Ignore these comments...My mistake..I was not updating the slice to be interpolated and it stayed on with the last slice after the build lattice loop :)

@ddemidov
Copy link
Owner

Could write a bit more clearly what does and what does not work?

@skn123
Copy link
Author

skn123 commented Jun 22, 2014

The variable "pImage" in the code above corresponds to an image slice. I was not updating that in the interpolation phase. Now, would / can there be a way to run interpolation for multiple spline lattices and multiple C(0) and C(1) rather than having to do it the serial way?

@ddemidov
Copy link
Owner

As I said before, applying several instances of a vex::mba from several host threads on the same compute device(s) does not make sense and is not safe. OpenCL implementation will serialize the compute kernels anyway. I think you can not get better than what you have now: parallel construction of several interpolators with multiple host threads, and serial application of the interpolators with a GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants