-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run "mba" in OpenCL #126
Comments
I don't remember all the implementation details right away, but it seems it would be possible to do the setup phase in OpenCL. It could even make sense because all underlying structures are regular. It will, however, take some time, because I am a bit busy at the moment. |
Denis, The setup was done using lot of C++11 code :) That's why I was having a hard time understanding it, but I have a suspicion that it can be done. Let me know if I can help in any way. |
I think it should be enough to implement the control lattice structure with OpenCL. See the referenced paper for the details of the algorithm. The biggest problem is that it needs to be done in generic way w.r.t. the number of dimensions, so one would need to do some OpenCL code generation. |
Why not take the FFT route? Handcraft those kernels for 1,2 and 3D and make it work on the device for these 3 dimensions. For all other dimensions, it can continue to use the current Host option? |
I don't like the idea of keeping separate (but very similar) kernels when they all may be generated from a single source. |
If you are referring to the basic BA algorithm in the paper, then there are loops that can easily be "OpenMP-fied" even in the current implementation. |
Yes. Feel free to provide a pull request :). |
So I looked at mba implementation a bit closer. Now I know why I decided to stay on the CPU for the initialization. First, and least important, VexCL supports parallel work with multiple compute devices. Since MBA may take a random set of coordinates to get interpolated values at, each device has to hold a complete copy of the control lattice Second, in the initialization loop over the data points here temporary arrays For example, take a look at c217b22. Here are results of No OpenMP
OpenMP
Note that setup now takes 11 seconds instead of 2.3. So unless I did something wrong here, it seems its better to leave the current MBA setup as it is. |
Ok, I did something wrong here. After replacing critical section with atomic in ecad92c.
This is better than serial version, but only slightly. This also introduces requirement that iterators to coordinates and values of data points are pointing at continuous chunks of memory. I am not sure the neglectable speedup worth it. What do you think? |
If I look at BA algorithm in the paper, there are three main loops. |
I have another question regarding mba_benchmark,cpp
Now, is this the only way to copy x and y to the device? Would it be possible to allocate values to C(0) and C(1) directly without having to first allocate them on the host? |
There are no host-allocated structures in the snippet you provided. Both If what you meant to ask is if its possible to initialize newly created device vector with a host vector data, then the answer is yes, it is possible. See the list of The mba_bechmark does look a bit ugly in this regard. 56e236a fixes that. |
Regarding the MBA algorithm, if you have a closer look at |
Yes, indeed that is the case of (for each point ( x_c , y_c , z_c) in P do) and that is what I meant by "data parallelism". If I were to use a CPU as an example, I would divide the data into k-parts (where k is the # of cores) and run BA on each of the control points within each part, except for the boundary points. This is a classic case of parallelizing region merging algorithms using the "Union-find" method. If you do using atomic, indeed it will kill the performance. Now, for your previous comment: where mSortedBands contains index values Any pointers on how one can do this in VexCl? Maybe this is the right time to write my first OpenCL (or should I say, VexCL) kernel :) Also, why are not freeing the x, y vectors in mba_benchmark? How do the device vectors get deallocated? Are they smart pointers? |
Ok, the question about data allocation is a lot clearer now. You could do this: vex::vector<float> x(ctx, n), y(ctx, n), z(ctx, n);
auto i = vex::element_index();
x = i % inSize[0];
y = ((i - inSize[0]) / inSize[0]) % inSize[1]; Not sure about z = vex::permutation(((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1])(mSortedBands); |
Regarding a K-way split of input data, how would you do it on CPU? Would each core skip points that do not belong to its subdomain? Or would you do a sort-by-key first, where key is the subdomain each point belongs to? It seems that on a GPU only second of these options would make sense, but then it has worse algorithmic complexity than the original operation. |
Perfect, then the only thing that would be of interest in this implementation would be the aspect of data parallelization. To understand this concept, take a look at Fig 3 in this paper: |
Regarding the deallocation: |
So in the mba_benchmark do I explicitly have to state p.clear(), v.clear() and so on...? |
No, you just let them go out of scope. No memory will leak. |
The technique described in the paper by Harrison et al (and domain decomposition in general) is suitable for fat cluster nodes or CPU cores. This is an example of coarse-grain parallelism. GPUs on the other hand, have fine-grained parallelism, where each thread is assigned to a single data point (e.g. matrix or vector element). So I don't think this approach could be used here. |
Thanks for the terms :) I always wondered why we cannot use GPU for coarse grained parallelism. However, the parallelism I am hinting at can (??) be achieved by the process of interleaving. As indicated in Sec 3.2 |
Could you provide a working openmp prototype for your idea? Just for the main loop over data points on page 4 with some random input. |
Denis Check the functions: void MBA::BAalg()
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t) and inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) I have extracted all the relevant codes pertaining to BA algorithm only. We are not assuming UNIFORM_CUBIC_C1_SPLINES for (int ip = 0; ip < noPoints; ip++)
{
// Map to the half open domain Omega = [0,m) x [0,n)
// The mapped uc and vc must be (strictly) less than m and n respectively
double uc = (data_.U(ip) - data_.umin()) * interval_normalization_factor_u;
double vc = (data_.V(ip) - data_.vmin()) * interval_normalization_factor_v;
int i, j;
double s, t;
UCBspl::ijst(m_, n_, uc, vc, i, j, s, t);
// compute w_kl's and SumSum w_ab^2 here:
double w_kl[4][4];
int k,l;
double sum_w_ab2_inv = 0.0;
UCBspl::WKLandSum2(s, t, w_kl, sum_w_ab2_inv);
sum_w_ab2_inv = double(1) / sum_w_ab2_inv;
double zc = data_.Z()[ip];
// check p. 231: k=(i+1) - flor(xc) and l = ...
for (k = 0; k <= 3; k++)
{
for (l = 0; l <=3; l++)
{
// compute phi_kl with equation (3)
double tmp = w_kl[k][l];
// 1. Originally
double phi_kl = tmp * zc * sum_w_ab2_inv;
// 2. Alternatively, to let it tapper of more smoothly (but more efficient if permantly)
//double t = 0.8; double phi_kl = (1.0-t)*tmp*zc/sum_w_ab2 + t*zc;
// 3. Alternatively, with linear term
// double alpha = 0.01; double phi_kl = (tmp*zc + alpha*(tmp - sum_w_ab2)) / sum_w_ab2;
// And alternatively for equation (5):
// from |w_kl|^2 to |w_kl| to get a weighted average
// just skip the next statement
tmp *= tmp;
delta_(i+k,j+l) += tmp*phi_kl;
omega_(i+k,j+l) += tmp;
}
}
}
// s,t \in [0,1) (but special on gridlines m and n)
// i,j \in [-1, ???
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t)
{
//int i = std::min((int)uc - 1, m-2);
//int j = std::min((int)vc - 1, n-2);
#ifdef UNIFORM_CUBIC_C1_SPLINES
i = 2*((int)uc) - 1;
j = 2*((int)vc) - 1;
#else
i = (int)uc - 1;
j = (int)vc - 1;
#endif
s = uc - floor(uc);
t = vc - floor(vc);
// adjust for x or y on gridlines m and n (since impl. has 0 <= x <= m and 0 <= y <= n
#ifdef UNIFORM_CUBIC_C1_SPLINES
if (i == 2*m-1) {
i-=2;
s = 1;
}
if (j == 2*n-1) {
j-=2;
t = 1;
}
#else
if (i == m-1) {
i--;
s = 1;
}
if (j == n-1) {
j--;
t = 1;
}
#endif
}
inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2)
{
sum_w_ab2 = 0.0;
double Bs0 = B_0(s); double Bt0 = B_0(t);
double Bs1 = B_1(s); double Bt1 = B_1(t);
double Bs2 = B_2(s); double Bt2 = B_2(t);
double Bs3 = B_3(s); double Bt3 = B_3(t);
double tmp;
// unrolled by Odd Andersen 15. dec. 2003, for optimization
tmp = Bs0 * Bt0; w_kl[0][0] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs0 * Bt1; w_kl[0][1] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs0 * Bt2; w_kl[0][2] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs0 * Bt3; w_kl[0][3] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs1 * Bt0; w_kl[1][0] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs1 * Bt1; w_kl[1][1] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs1 * Bt2; w_kl[1][2] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs1 * Bt3; w_kl[1][3] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs2 * Bt0; w_kl[2][0] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs2 * Bt1; w_kl[2][1] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs2 * Bt2; w_kl[2][2] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs2 * Bt3; w_kl[2][3] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs3 * Bt0; w_kl[3][0] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs3 * Bt1; w_kl[3][1] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs3 * Bt2; w_kl[3][2] = tmp; sum_w_ab2 += tmp * tmp;
tmp = Bs3 * Bt3; w_kl[3][3] = tmp; sum_w_ab2 += tmp * tmp;
// int k,l;
// sum_w_ab2 = 0.0;
// for (k = 0; k <= 3; k++) {
// for (l = 0; l <=3; l++) {
// double tmp = w(k, l, s, t);
// w_kl[k][l] = tmp;
// sum_w_ab2 += (tmp*tmp);
// }
// }
} |
I was not saying that you are wrong, I merely asked a working prototype of Cheers, |
I will take a crack at it. I would not, however be able to translate this it to a VexCL format, but I think I can make it using OpenMP. |
An openmp implementation should be enough. Also, you don't need to implement the full BA algorithm. A parallelization of the loop on p.4 is enough. |
Meanwhile let me try this with OpenMP |
Here is the timing when using the current version of mba_benchmark
cessing) ^C
cessing) surf(0.5, 0.5, 0.5) = -0.000214423 Profile: 1105.916 sec. E:\Binaries_MinGW\vexcl\examples> The data size = 16 * 2048 * 1024 |
So it would be interesting to see if the setup time can be reduced. I am working on the OpenMP implementation but it will be different from your C++11 based stuff. |
But that's timing for VexCL's benchmark. I was more interested in timings for your own problem (I was secretly hoping that setup takes negligible fraction of time for your use case). |
My problem is a little bit more involved. First of all a couple of questions: |
The grid has non unit spacing, so some scaling is done before computing |
But, suppose if I have grid that are indeed integers; lets say in my case [0..15] X [0..2047] X [0..1023], then s[d] where d = [0, 1,...NDIM] would be [0,1/16,2/16,....15/16] X [0,1/2048,2/2048,...,2047/2048] X [0,1/1024,2/1024,....,1023/1024] |
And this is what I am not understanding. So, if I have my grid that is uniform, how could one build those numbers for s[d] ? I think that is the only problem that I am facing now. |
Look again at the lines I quoted before. First line scales In case the grid has integer coordinates (note that this won't be the case on subsequent hierarchy levels anyway), |
Yes, I did see those lines. The problem is the grid that you provide in mba_benchmark is also between [0,1](when you call the grid creation module). So is it mandatory for me to normalize the grid values between 0,1 which is why I have asked you the question: Should I normalize my actual grid coordinates to lie between 0,1 to conform with the mba architecture or can it work for integer grid locations. |
The grid in benchmark is just an example. |
One more question here: In your mba_benchmark, you assume that all points in the grid need to be interpolated, whereby all points in the grid form part of the lattice. Is this the mandatory way of implementing this? I think the # of control points should (generally) be much less that the actual # of points on the grid. |
There are no restrictions on structure or number of control points or interpolation points. Benchmark is just a benchmark; it is there just to do a stress test of the algorithm. |
Note to myself: each control point updates |
I am facing a problem running vex::mba in parallel threads. The crash happens when I am doing the Z = surf(C(0),C(1)). This means that I cannot run concurrent versions of interpolation on the GPU from multiple host threads. In that case, If I were to serialize this, I would need to save the spline coefficients. However, the API that is provided for MBA does not allow me to do this. Is there a way out of this conundrum? |
Running interpolation on the same GPU from multiple threads does not make sense performance-wise, since OpenCL will serialize the kernels anyway. Moreover, it is not safe (see notes about thread safety in OpenCL specification) and you should guard calls to Applying an instance of MBA on GPU other than was used for construction is not supported. Regarding the incorrect results in 3D: I have mostly used mba for 3D problems, and it worked for me. Could you please provide a simple test case? |
So, thats why I said I will serialize the interpolation operation. However, as the MBA algorithm runs on the CPU I can potentially parallelize it. For that, I would need to create a I will provide you with a set of 3D control as well as test points. |
Only constructor of MBA runs on CPU. The interpolation itself runs on GPU (or, more generally, a compute device). You could use a |
Maybe, this should be documented somewhere in the code. |
What exactly? The use of smart pointers with a class that does not provide a default constructor? |
Yes; Especially for this example. I am familiar with std::shared_ptr. I will have to look into boost::ptr_vector and see if that is more beneficial. |
I tried it with std::shared_ptr and it apparently worked. However, I think the vex::mba is being overwritten // This defines a context to be used by VexCL
static vex::Context ctx( vex::Filter::Env );
// A typedef for the routine to be called from vexcl
typedef vex::mba<2, double> mba_type;
// A vector of such mba modules
static std::vector< std::shared_ptr< mba_type > > mba_vector(cBands_Num, NULL); and the calling loops are: //#ifdef USE_OPENMP
//#pragma omp parallel for
//#endif
for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
{
....//The image to be interpolated pImage
// 4.) Run the MBA pipeline on this slice
MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
}
// In this loop we use the spline lattices to interpolate missing pixels. This
// step is done serially as the interpolation is effected on the GPU.
for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
{
std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
// 5.) Run the MBA pipeline on this slice
MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
.....//The interpolated image
} and the place where I build the spline lattice ....
//////////////////////////////////////////////////////////////////////////////
//From this point, we interface it with vexcl/mba.hpp
mba_vector[sIdx] = std::make_shared<mba_type>(ctx,
make_array2<double>(-0.01, -0.01),
make_array2<double>(1.01, 1.01),
p, v, make_array2<size_t>(2, 2)); I can compute the spline lattice for all slices in parallel and the do the interpolation serially. |
This looks mostly valid. Why do you make Also, what do you mean by 'vex::mba is being overwritten'? |
Static is because I do not want to errors related to "multiple definitions". Now, I did the other way as for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
{
....//The image to be interpolated pImage
// 4.) Run the MBA pipeline on this slice
MBA_BuildSplineLattice(pImage, h_Rows, w_Cols, sIdx);
//}
// In this loop we use the spline lattices to interpolate missing pixels. This
// step is done serially as the interpolation is effected on the GPU.
// for (int sIdx = 0; sIdx < cBands_Num; ++sIdx)
//{
std::cout << "Interpolate: "<<cBands[sIdx] << std::endl;
// 5.) Run the MBA pipeline on this slice
MBA_Interpolate(pImage, h_Rows, w_Cols, sIdx);
.....//The interpolated image
} and it works like a breeze! Can you see if you can reproduce this at your end? I think there is some issue with storing previously generated lattices. Yes, all the other factors that you have mentioned are stable. |
Ignore these comments...My mistake..I was not updating the slice to be interpolated and it stayed on with the last slice after the build lattice loop :) |
Could write a bit more clearly what does and what does not work? |
The variable "pImage" in the code above corresponds to an image slice. I was not updating that in the interpolation phase. Now, would / can there be a way to run interpolation for multiple spline lattices and multiple C(0) and C(1) rather than having to do it the serial way? |
As I said before, applying several instances of a |
Currently in mba_benchmark.cpp, we see that the Spline creation module is run in CPU while the interpolation is done in GPU. This makes sense. However, would it be possible to make this portion (i.e., Spline creation module) work in GPU?
The text was updated successfully, but these errors were encountered: