Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support also doing processing on GPUs #349

Open
mohawk2 opened this issue Nov 14, 2021 · 26 comments
Open

Support also doing processing on GPUs #349

mohawk2 opened this issue Nov 14, 2021 · 26 comments

Comments

@mohawk2
Copy link
Member

mohawk2 commented Nov 14, 2021

This was discussed on pdl-devel back in Dec 2015 (see https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=201512). There is also some discussion in Jun this year on PerlMonks (https://www.perlmonks.org/?node_id=11134476).

I recently revisited this; my thinking so far is that the best platform for mixed CPU/GPU processing seems to be OpenCL. However, OpenCL C doesn't support native-complex processing, but it would be possible to use OpenCL C++ with a class that overloads the arithmetic operators for source compatibility with existing Code; an easy first-cut alternative would be to only support GPU use on real types (which may also be further restricted to not supporting double for <OpenCL 2.0 environments - GPUs have long been aimed at speed rather than extreme precision, because of graphics' needs).

Defining some terms: "kernel" is a term in GPGPU-land for the code that actually runs in the GPU, typically being run once per "work item", which is basically identical to the "inner loop" inside a broadcastloop %{ /* blah */ %} (and/or loop). Current PDL operations have code generated by PP that doesn't distinguish between the "kernel" and the setup code (which runs on the "host"), since the output is just a bunch of C to put inside the function. When that distinction starts to matter, it will need some work.

Things that would need doing to allow this:

  • split Code (etc) into kernel and setup code; another benefit of doing so would be to aid lazy-building ndarrays since PDL could be made to set up a set of very small ndarrays, call the kernel on that, then do something with the results, then throwing them away - an example might be to sumover a very long sequence using basically no memory
  • figure a way to allow calling at least one other different broadcastloop initiator (as called in PP-generated C code); for back-compat with current CPU-only it's not desirable to make PDL have a hard requirement of OpenCL to operate at all
  • implement dividing work into a task queue with 4x as many items as the number of processing-units as alluded to here (https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=202109) (which will also help with parallelising non-deterministic tasks)
  • either figuring how to identify code that calls into GPU-unavailable C functions (e.g. LAPACK) and having that not use GPU, or figure the best way to manage opt-in / opt-out of GPU use - an easy way might be to simply try OpenCL-compiling the generated code, which undoubtedly will fail if any unsuitable function calls (or headers) are found

Resources:

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 15, 2021

An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL).

One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas.

Resources:

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 15, 2021

GH-hosted GitHub Actions-runners don't have GPUs. But emulators seem to exist that would allow CI of GPU offloading, since we aren't worried about performance, only checking basic functionality.

Resources:

@karlglazebrook
Copy link
Member

karlglazebrook commented Nov 15, 2021 via email

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 15, 2021

PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already.

To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave (alternative: https://mac.r-project.org/openmp/#do). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w

Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-)

@karlglazebrook
Copy link
Member

karlglazebrook commented Nov 15, 2021 via email

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 15, 2021

This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes.

My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling?

@karlglazebrook
Copy link
Member

karlglazebrook commented Nov 17, 2021 via email

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 17, 2021

I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive.

Glad to hear it!

Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.

hashcat/hashcat#2976 indicates that the Apple Silicon doesn't have double precision in hardware:

  Device Name                                     Apple M1
[...]
Double-precision Floating-point support         (n/a)

so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).

An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "broadcastloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.

@karlglazebrook
Copy link
Member

karlglazebrook commented Nov 17, 2021 via email

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 17, 2021

Device Name Apple M1
[...]
Double-precision Floating-point support (n/a)

Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate. But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math

Deprecating it doesn't mean they are not supporting it for some time to come. In particular, it in no way means the clinfo is incorrect. Even Apple aren't quite that arrogant.

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 18, 2021

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Further thinking on that (as also discussed on IRC #native); PDL operations are referred to internally as "transformations". Tuomas's vision as I infer it from docs and code was there would be function-type ones as well as slice-type ones (EDIT in pdl.transtype there was also possibly a "retype"-type, but the field was never used so it was removed this year), though only slice-type ones were implemented. To achieve this loop-fusion there needs to be parsing of the code as supplied (PP already does this, so it could just be stored somewhere), then sellotaping the relevant loops together based on knowledge of the desired output. That knowledge would come from making function-type transforms a real thing, which would be done by lazy-evaluating more, such that a not-yet-evaluated multi-stage transform could, on evaluation, be queried for all its shape, and that could then be sellotaped together into one big bespoke transform.

The Wikipedia page linked above says that latest clang (12) and gcc (11.1) don't do redundant-allocation removal, and it seems to me that avoiding redundant allocation is very important. One way forward might be to use variable-length arrays (allocated on stack) so the compiler will know it doesn't necessarily ever need to go in main RAM.

@mohawk2
Copy link
Member Author

mohawk2 commented Nov 20, 2021

Actual measurement of performance with a simple loop-fusion, code:

use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use PDL;
use Inline Pdlpp => 'DATA';

sub with_time (&) {
  my @t = gettimeofday();
  $_[0]->();
  printf "%g ms\n", tv_interval(\@t) * 1000;
}

$PDL::BIGPDL = $PDL::BIGPDL = 1;
my $N = $ARGV[0];
my ($a, $b, $c) = (ones($N), sequence($N), sequence($N));

print "with intermediates\n";
with_time { print +($a + $b * sin($c))->info, "\n" } for 1..5;
print "manual loop-fusion\n";
with_time { print PDL::a_plus_b_sin_c($a, $b, $c)->info, "\n" } for 1..5;

__DATA__
__Pdlpp__
pp_def('a_plus_b_sin_c',
  Pars => 'a(); b(); c(); [o]d()',
  GenericTypes => ['D'],
  Code => '$d() = $a() + $b() * sin($c());',
);

Results with $N of 60e6 (tuned to put real load on my setup, but not cause meaningful swapping), with the info removed:

with intermediates
1393.23 ms
1402.13 ms
1385.48 ms
1389.91 ms
1375.89 ms
manual loop-fusion
663.921 ms
646.778 ms
639.667 ms
641.872 ms
646.683 ms

In nice round numbers, the manually loop-fused operation above was a little over twice as fast as the naive version. This was also true with lower values of $N.

@mohawk2
Copy link
Member Author

mohawk2 commented Dec 31, 2021

I am merging #354 into this.

In an email in Dec 2015, Chris wrote things that could be incorporated in future PDL*:

ArrayFire GPGPU parallel computing:
http://www.arrayfire.com/docs/index.htm
This could give PDLng a quick start for
high performance using GPUs.

COS, the C Object System:
https://github.com/CObjectSystem/COS/blob/master/doc/cos-draft-dls09.pdf
https://github.com/CObjectSystem/COS
This could give PDLng an object oriented
interface that maintains a C FFI so that
PDLng could call or be called from other
languages. Matplotlib anyone?

HDF5 Attributes:
http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html
I was thinking that our PDLng implementation
should cover the bases from FITS, HDF5, named
dimensions,.. Ingo has already done some
work along these lines.

@mohawk2
Copy link
Member Author

mohawk2 commented Dec 31, 2021

Just been re-reading the COS docs. It is primarily about dynamic polymorphism, which I cannot see a great need for in any current or future PDL. The heart of PDL is to write some type-generic Code (with possible Bad, and/or Back analogues) which would do the right thing on any supported PDL types.

A need for PDL that I am starting to see is first-class dimensions that could be referred to in Code, which would allow dimension-updates to be loop-fused (and setdims etc to be a first-class operation rather than underlying API). That would imply RedoDims stuff to also be first-class, for the same reason. A thing that would help both RedoDims and BackCode would be a more mathematical (and potentially therefore reversible) way to express transformations. That suggests a more declarative style, which seems to be antithetical to the imperative style of C.

I am open to the idea that COS could be part of the solution.

Named dimensions (HDF5 etc style) would be a beneficial thing generally, and an "einops" style of expressing things might be part of that reversible dimension-translating.

It seems unlikely, but not impossible, that ArrayFire is going to help us much. However, they do support farming processing out to OpenCL and CUDA, so it could be an alternative to OpenMP. What would be needed is an Alien::ArrayFire similar to Alien::OpenMP, then a generalisation of broadcastloop to make our current pthread-only implementation be just one possibility. Given the language is a bit different from C, that might be challenging, and OpenMP/OpenCL would probably be easier. However, the generalisation would need to happen in any case. Also needed would be to store in the pdl_transvtable the Code etc, for accessing by OpenCL/ArrayFire/whatever.

@zmughal
Copy link
Member

zmughal commented Jan 2, 2022

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

@mohawk2
Copy link
Member Author

mohawk2 commented Jan 2, 2022

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

I've read the matmult code in the past, and at the time (for me) it was quite eye-opening! This does bring in the idea of enhancing >1-dimensional loop() constructs to use tiling (probably with a check that the contained $x() lookups all use no specified index-lookups, to avoid any out-of-order sequencing causing problems).

I'm also reading https://www.alcf.anl.gov/sites/default/files/2020-01/OpenMP45_Bertoni.pdf which is a really good summary of OpenMP 4.5 constructs, especially the map stuff. It would be conceivable to use that with a bit of inserted code to calculate tiles, to insert OpenMP #pragmas to achieve tiling.

@mohawk2
Copy link
Member Author

mohawk2 commented Jan 3, 2022

Thinking about how to make axisvalues a proper transformation, including the same for ndcoords. What might help with the need to loop over arbitrary numbers of dims would be a new multidimloop directive, something like (with the benefit of any ndarray "access" benefiting from the "context" so being able to just be $x() instead of spelling out the dimensions in some way):

multidimloop(PARENT, n) %{
  $CHILD() = n;
%}

This doesn't yet help with a proper reshape transformation, though as that is pure Perl of a clump(-1) then a setdims, that just needs a first-class setdims.

A more generalised multiloop is not yet clear to me but seems necessary.

@mohawk2
Copy link
Member Author

mohawk2 commented Jan 4, 2022

#324 acts as a reminder once this gets closer to completion that PDL::Dataflow will want a thorough update since it still has stuff in there about "families" and some quite complicated stuff that I don't think would be very useful.

@mohawk2
Copy link
Member Author

mohawk2 commented Feb 19, 2022

Note to self: when (not if) we complete this, we'd update the PDL::PP docs to revise this:

generate code that takes advantage of the vectorising/parallel computing features of your machine (this a project for the future)

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 29, 2022

Following on from a long-running branch on main PDL (index_ops) which has the bare start of slightly-more-wide implementation of "index operations" (extending the idea of indadd to multiplication, copying, etc) and some discussion on pdl-porters (https://www.mail-archive.com/[email protected]/msg05754.html). That links to the High-Performance Fortran (HPF) 2.0 spec (http://dacnet.rice.edu/versions/hpf2/hpf-v20.pdf) which on pages 91-97 (PDF pages 101-107) discusses "scatter-combine" operations. Those can be considered equivalent to using indexND with plus etc.

indexND is implemented using the range operation, which sort of cheats in PDL terms by not having any Pars at all. This is alluded to in the "Practical Magick" paper.

It occurs to me that indadd, and the above "index operations", and indexND, can be considered a generalisation of "broadcasting". That broadcasts the given operation over the full range of its broadcasted-up inputs. One consequence of this is that read (and writeback with two-way dataflow) routinely copy the whole broadcasted range, which as noted in the docs of range is wasteful.

What we could have in addition is "narrowcasting". It would generate additional read (and if needed writeback) code, possibly in different functions with additional index-array parameters (these might or might not want to be proper ndarrays). This would be parallel with the "vaffine" form of transformation which is a rectangular view onto another ndarray's actual data block.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 29, 2022

Another idea that occurred while pondering the above can be considered moving in the opposite direction; small and lightweight vs increasingly magnificent abstractions. Currently PDL operations (in particular the XS wrapper functions) if given non-broadcasting inputs, and/or single elements, and/or Perl scalars, do the full conversion of those to ndarrays, then set up a full pdl_trans, then call the full broadcasting apparatus.

This has been noted as being expensive (i.e. slow). An alternative would be to have the XS wrappers have some extra generated code which detects when the full machinery is not required, and interpolates a modified version of the fundamental code (or "kernel"). That could even allow for just using IV/NV without further conversion, and even allow translating $x(n=>n1,m=>m1) into lookups into n-dimensional Perl arrays-of-arrays. One could think of this as making PDL function as a sort of Perl Language for operating on Data (not very catchy, probably needs a better name).

This would allow PDL libraries to compete directly with e.g. https://metacpan.org/pod/Math::GSL

@mohawk2
Copy link
Member Author

mohawk2 commented Apr 21, 2022

As noted in the updated PDL::Dataflow and on this PerlMonks thread, it would be interesting to shift PDL's threading model from "wait while all the threads return" to "start them off then react to their completion" using an event loop.

@mohawk2
Copy link
Member Author

mohawk2 commented May 22, 2022

Loop fusion note: using which, index, where, etc should be dealt with appropriately inline, rather than actually generating the index sets in a naive way. Observation from https://perlmonks.org/?node_id=11144070

@mohawk2
Copy link
Member Author

mohawk2 commented Jun 7, 2022

Discussion of parsing Code chunks in order to support MPFR, etc, which might also facilitate translating into OpenCL: https://perlmonks.org/?node_id=11144472

@KJ7LNW
Copy link

KJ7LNW commented Nov 1, 2022

This might be a discussion for PDL::LinearAlgebra, but FYI in case it is relevant:

"The NVBLAS Library is built on top of the cuBLAS Library using only the CUBLASXT API (refer to the CUBLASXT API section of the cuBLAS Documentation for more details). NVBLAS also requires the presence of a CPU BLAS lirbary on the system. Currently NVBLAS intercepts only compute intensive BLAS Level-3 calls (see table below). Depending on the charateristics of those BLAS calls, NVBLAS will redirect the calls to the GPUs present in the system or to CPU. That decision is based on a simple heuristic that estimates if the BLAS call will execute for long enough to amortize the PCI transfers of the input and output data to the GPU"

[ https://docs.nvidia.com/cuda/nvblas/ ]

It currently supports these routines, all others are passed to the system BLAS:

  • gemm
  • syrk
  • herk
  • syr2k
  • her2k
  • trsm
  • trmm
  • symm
  • hemm

@zmughal
Copy link
Member

zmughal commented Dec 26, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants