************************* Documentation for MPI FFTW *************************

SECTION 1. Introduction

This directory contains a parallel multi-dimensional FFT based on MPI
for use with the FFTW library (http://theory.lcs.mit.edu/~fftw).

(See the FFTW documentation for usage restrictions and licensing
information.)

The fftwnd_mpi subroutine lets you compute an FFT of data that is
distributed over the processes in specific MPI communicator (typically
MPI_COMM_WORLD, to use all processes).  The data will be distributed
across the processes with respect to the rows (first dimension) of the
data.  (The data distribution will be described in more detail below.)

For efficiency, fftwnd_mpi gives the caller the option of having the
transformed data returned in a transposed order (this saves the
communications cost of an extra transpose of the data).

The subdirectory transpose_mpi in this directory contains in-place
transpose routines for MPI that are used by fftwnd_mpi.  These
routines can also be used completely independently of FFTW by anyone
needing an in-place transpose routine for MPI.

IMPORTANT: The transpose_mpi routines will not work "as is" if you
have changed fftw_real to something other than double.  If you have
changed fftw_real, then you must change the transpose element type
definitions in transpose_mpi/transpose_mpi.h!

SECTION 2. Using fftwnd_mpi

Users of fftwnd_mpi should be familar with the usage of FFTW.  There
are, however, a few differences between fftwnd and fftwnd_mpi; these
will be noted below.  All of the steps described below must be
performed on all processes involved in the FFT computation.

SECTION 2.1: Header files

To use fftwnd_mpi, first include the following header files:

#include <mpi.h>
#include <fftwnd_mpi.h>

SECTION 2.2: Initializing MPI

Second, you have to remember to initialize MPI at the beginning of
your program, and "finalize" it at the end of your program:

void main(int argc, char **argv)
{
     MPI_Init(&argc,&argv);	
	...
     MPI_Finalize();
}

SECTION 2.3: Plan creation

Next, you will create a plan (of type fftwnd_mpi_plan), just as for
the normal FFTW, for computing the transform.  This is done by calling
the fftwnd_mpi_create_plan subroutine:

extern fftwnd_mpi_plan fftwnd_mpi_create_plan(MPI_Comm comm,
                                              int rank, const int *n,
                                              fftw_direction dir,
                                              int flags);

The arguments to this function are the same as those for
fftwnd_create_plan, with two exceptions.  First, there is the comm
argument, which specifies the set of processors that will be
participating in the computation (typically, MPI_COMM_WORLD to use all
processes).  Second, since all fftwnd_mpi transforms are in place, the
FFTW_IN_PLACE flag is ignored if you include it in flags.  Note that
the n[] array here gives the total size of the array to be transformed
(as opposed the the size of the data local to the current process).

The rank of the array must be greater than 1.

Just as for fftwnd_create_plan, there are two alternate interfaces for
convenience in performing 2D and 3D transforms:

extern fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm,
                                              int nx, int ny,
                                              fftw_direction dir, int flags);
extern fftwnd_mpi_plan fftw3d_mpi_create_plan(MPI_Comm comm,
                                              int nx, int ny, int nz,
                                              fftw_direction dir, int flags);

A plan can be reused for many transforms of the same dimensions.

NOTE: The transform is computed most efficiently when the first two
dimensions of the data are divisible by the number of processes.

SECTION 2.5: Plan destruction

When you are done with a plan, deallocate it using:

extern void fftwnd_mpi_destroy_plan(fftwnd_mpi_plan p);

SECTION 2.5: Data format

Here, we describe the format of the data that will be transformed.
This is probably the most complicated part of using fftwnd_mpi, so pay
close attention.

The data that the FFT is performed on is distributed across the
processors. This data must be distributed with respect to the rows
(first dimension, or "x" dimension) of the multi-dimensional array in
a specific way.  A process can find out what portion of the array it
is expected to store by calling the fftwnd_mpi_local_sizes function:

extern void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
                                   int *local_nx,
                                   int *local_x_start,
                                   int *local_ny_after_transpose,
                                   int *local_y_start_after_transpose,
                                   int *total_local_size);

You pass this function a plan, and it returns the next 5 parameters,
which describe the local dimensions of the array as follows:

Suppose that the total array has dimensions n1 x n2 x n3 x ... x nd.
The current process is then expected to store the following subset of
the total array:

[local_x_start..local_x_start+local_nx-1] x n2 x n3 x ... x nd

For example, if you had a 100x200 matrix, local_nx = 10, and
local_x_start = 60, then it means that the local process stores 10
rows of the matrix (each with 200 columns), starting at row 60
(i.e. rows 60-69, inclusive).

The local data is expected to be stored in row-major order (C order).

As will be described below, fftwnd_mpi gives the caller the option of
leaving the transformed data in a transposed order.  The next two
parameters returned by fftwnd_mpi_local_sizes describe the amount of
the transposed array that would then be stored in the current process.

When fftwnd_mpi transposes the first two dimensions of the data; in
the above example, the transformed data would then have dimensions
n2 x n1 x n3 x ... x nd.  This transposed data is distributed with
respect to the first dimension of the data, now the n2 or "y"
dimension.  The current process will hold a slice of the array
starting at local_y_start_after_transpose and consisting of
local_ny_after_transpose rows of the transposed array.

The final parameter returned by fftwnd_mpi_local_sizes,
total_local_size, tells the caller the number of data points that will
be stored on the local process.  Thus, when you allocate the data
local to the current process, you must create an array of size
total_local_size multiplied by the size of each data point in bytes
(i.e. sizeof(fftw_complex)).

IMPORTANT: in certain cases (where the number of processes does not
divide the dimensions of the array), total_local_size will be zero on
some processes.  This is okay!  However, you should probably check for
this condition to avoid disposing of NULL pointers, etc.  Note that
you still must call fftwnd_mpi from processes that have zero
total_local_size (in order to make sure that MPI_Barrier calls do not
deadlock).

SECTION 2.6: Computing the transform

Once you have created the plan and set up the local data as described
above, computing the transform is simple.  You just call the following
function, fftwnd_mpi, on all processes in the MPI_comm communicator
passed to fftwnd_mpi_create_plan:

extern void fftwnd_mpi(fftwnd_mpi_plan p,
                       int n_fields, fftw_complex *local_data,
                       fftwnd_mpi_output_order output_order);

This function has a different interface than fftw or fftwnd.  There
are no howmany, stride, or dist parameters--these features are not
supported by the MPI transform.  A limited "howmany" feature is
supported through the n_fields parameter, described below.  (Most
users will simply pass 1 for n_fields.)

p is a plan for computing the transform.  It contains the dimensions
of the transform and all other required data (see section 2.3).

local_data is a pointer to the data local to the current process, as
described in section 2.5.  The transform is computed in-place, so upon
return local_data will point to the transformed data (the original
data is destroyed).

As mentioned earlier, fftwnd_mpi gives the caller the option of
returning the transformed data in transposed order.  This option is
specified through the output_order parameter, which can take on one of
two values:

FFTW_NORMAL_ORDER: Transform is returned in the same order as the
input.  The slice of the transformed array pointed to by local_data is
the same as it was for the input array, in the same format.

FFTW_TRANSPOSED_ORDER: Transform is returned in transposed order (the
first two dimensions of the data are switched).  The slice of the
output data pointed to by local_data is described in section 2.5.

Users are urged to consider using FFTW_TRANSPOSED_ORDER, if possible,
since it achieves considerably better performance than
FFTW_NORMAL_ORDER.  (Essentially, it fftwnd_mpi only has to perform
one transpose operation instead of two.)

* Note: If you compute the inverse transform on FFTW_TRANSPOSED_ORDER
output, the plan for the inverse transform should be created using
the transposed array dimensions (i.e. n2 x n1 x n3 x n4 x ...).

The n_fields parameter allows multiple transforms to be computed at
the same time with greater efficiency than if they were computed
separately.  Here, each data point, rather than being a single
fftw_complex value, is an array (or a struct) of n_fields
fftw_complex values.  The FFT of each of these n_fields "fields" is
computed at the same time.

To compute just one transform, with a normal array, use n_fields = 1.

On a uniprocessor, the use of n_fields with fftwnd_mpi is equivalent
to passing howmany=n_fields, idist=1, and odist=1 to fftwnd.

Note that local_data should point to an array of size (in bytes) equal
to total_local_size * n_fields * sizeof(fftw_complex).

NOTE: This transform is computed most efficiently when the first two
dimensions of the data are divisible by the number of processes.

SECTION 2.7: Compiling and linking your code

Your code should be linked with -lm and -lfftw (the uniprocessor FFTW
library). It must also be linked with fftwnd_mpi.o,
transpose_mpi/transpose_mpi.o, and transpose_mpi/TOMS_transpose.o.

These files should all be compiled however you normally compile C
programs for MPI on your system.

Also, you need to set the include path for your compiler so that it
can "see" fftwnd_mpi.h and transpose_mpi/transpose_mpi.h (with the -I
compiler option).

For an example, see the test programs (described below).

SECTION 3: Test programs.

There are two test programs for testing the correctness and speed of
fftwnd_mpi, respectively.  The are called test_fftwnd_mpi and
time_fftwnd_mpi, require no command-line options, and can be compiled
by typing make in the fftwnd_mpi directory.

To compile them, however, you first must have compiled the
transpose_mpi code (type make in the transpose_mpi directory).  (See
the transpose_mpi/README file).

You will, of course, have to modify the Makefiles to reflect the
location of various libraries on your system and whatever options you
use to compile MPI programs.

SECTION 4: Algorithm

Briefly, fftwnd_mpi works by computing the FFT of the local data,
transposing the data, FFTing the remaining dimension, and (optionally)
transposing back.

The most complicated part of this process is the transpose, which is
performed in-place.  The transpose was designed under the assumption
that inter-process communication would be very expensive, and thus it
performs a number of permutations on the data in order to exchange a
minimal number of large blocks between the processes.  On some
architectures with very efficient communications, this is not
optimal--it would be better to perform fewer permutations on the data
and have more communications.  We are considering how to address the
problem of optimizing the balance between communications and local
permutations for future versions of fftwnd_mpi. 

SECTION 5. Feedback

You can contact the authors of this program, Matteo Frigo and Steven
G. Johnson, at fftw@theory.lcs.mit.edu.  Don't hesitate to drop us a
line if you have any questions, comments, or suggestions.
