Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer scipy.fft to numpy.fft #211

Open
mreineck opened this issue Jul 22, 2020 · 9 comments
Open

Prefer scipy.fft to numpy.fft #211

mreineck opened this issue Jul 22, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@mreineck
Copy link

I just noticed that the code uses numpy.fft in many places.

If you have the choice, I strongly suggest to use scipy.fft instead, because this is much faster for multi-D transforms. Also the interface is practically identical (in contrast to the older scipy.fftpack module).

@sjperkins
Copy link
Member

Thanks for raising this @mreineck.

As I understand it, scipy.fft is pypocketfft? In that case, we'd certainly want to leverage the performance enhancments.

@landmanbester do you think this is something you could address in #204 or do you think that would be outside the scope of the PR? Of course, since you're introducing a dependency on ducc0, you could use the pypocketfft within the gridding functionality.

It seems that scipy fft could be used in the dask wrappers:

https://github.com/ska-sa/codex-africanus/search?q=fft&unscoped_q=fft

scipy is an optional codex dependency. I wonder whether we should force the user to install the scipy in order to use the nifty gridding functionality, or silently fall back to numpy if scipy is not installed? Anyone have strong opinions here?

@mreineck
Copy link
Author

As I understand it, scipy.fft is pypocketfft?

Yes and no :) It's pypocketfft, but since scipy wheels are built without CPU-specific optimizations it will be somewhat slower than a self-compiled pypocketfft or ducc0.fft package.

To make things even more confusing, numpy.fft is also pocketfft, but based on an old C version. As far as I understand, numpy can't upgrade (at least not for the moment), since they don't accept extensions written in C++.

It seems that scipy fft could be used in the dask wrappers:

In fact I opened another issue earlier today for dask, suggesting that they switch from numpy.fft to scipy.fft internally as well ;-)

@mreineck
Copy link
Author

In the case that ducc0 becomes a mandatory dependency (e.g. when the w-gridder is included), the best option is probably to use ducc0.fft throughout the code.

@landmanbester
Copy link
Collaborator

Yep, I could do this in that PR. Maybe we should consider dask wrappers for the FFT's in ducc0? This would be the first step towards recognising that, eventually, we are going to need a distributed FFT. @mreineck do you already have something like that in NIFTy?

@mreineck
Copy link
Author

@mreineck do you already have something like that in NIFTy?

You mean an FFT for arrays that are distributed over several tasks along one axis?
I have Python code for doing the necessary redistribution; this was originally in NIFTy, but we scrapped this mode of parallelization. I should find it easily in the old commits, though.

Sooner or later I need to have MPI array transposition code in ducc as well, but that will probably not happen very soon.

@landmanbester
Copy link
Collaborator

I meant any out of memory 2D FFT. I guess splitting up tasks along one of the spatial axes is one way to go. We saw a talk by someone at Cambridge that was trying to do this and he made it look pretty difficult. Philipp told me that is was in principle in NIFTy already, just not the IO part of it. I am mainly thinking about this because of the promised 60kx60k SKA images but I know some wide-field VLBI people that would already find this very useful

@mreineck
Copy link
Author

I guess splitting up tasks along one of the spatial axes is one way to go.

It's what FFTW does, and it seems to work reasonably well. Of course, a full FFT requires two full array transpositions, which is not cheap compared to the pure FFT computation cost.

As long as the array fits into a single node's memory, I'd definitely prefer a multi-threaded transform over MPI (and perhaps do simultaneous independent FFTs on multiple nodes, if that's possible) ... but 60kx60k is getting close to the limit.

Assuming I had to do work with a huge set of visibilities and such a large image, I'd try the following approach with ducc.wgridder:

  • determine the extent of the w range
  • assign equal chunks of the w range to my MPI tasks (1 MPI task per compute node)
  • put every visibility onto all nodes it can affect (this can be more than one node because it is gridded onto several w planes)
  • run ms2dirty independently on every task
  • allreduce_sum the result

This won't scale perfectly, but on the other hand it's really simple to set up while still being quite efficient.

In any case, the only missing ingredient to the kind of MPI-distributed FFT you are looking for is a Python functionality that can transpose a MPI-distributed 2D array. I'm pretty sure this must exist already...

@sjperkins
Copy link
Member

It's what FFTW does, and it seems to work reasonably well. Of course, a full FFT requires two full array transpositions, which is not cheap compared to the pure FFT computation cost.

As long as the array fits into a single node's memory, I'd definitely prefer a multi-threaded transform over MPI (and perhaps do simultaneous independent FFTs on multiple nodes, if that's possible) ... but 60kx60k is getting close to the limit.

As I understand the problem, the challenge of a distributed FFT is handling the communication pattern inherent in the butterfly diagram:

butterfly diagram.

At some point one needs to broadcast all images to all nodes and it just becomes really expensive from a data transfer POV?

@mreineck
Copy link
Author

As long as your 1D FFTs fit into memory, you shouldn't need to worry about the butterfly.

The normal strategy is:

  • START: 2D array is distributed over n tasks along axis 0
  • do the FFT along axis 1 (can be done completely without communication) for the partial array on every task
  • transpose the array (this requires a fairly nasty all2allv communication in the general case)
  • do the FFT along axis 1 (which was axis 0 originally) for the partial array on every task
  • the FFT is now complete
  • (optional) do another transpose to re-establish the original storage scheme

It's not terribly complicated, but it is indeed expensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants