************************** FFTW Threads **************************

This directory contains an extension to the FFTW library that allows
transforms in one or more dimensions to be parallelized on shared
memory machines supporting threads.

Currently, POSIX, Solaris, and BeOS threads are implemented and
tested.  (Also included are Win32 and MacOS MP code.  The Win32 code
has been reported as working by users, but we have not tested it
ourselves.)  If you have a shared-memory machine that uses a different
threads API, it should be a simple matter to include support for it;
see the file fftw_threads.h for more detail.

**** Installation: ****

Modify the Makefile to use the C compiler of your choice.  If you plan
to install the fftw_threads library and include files into some other
directory (using "make install") modify the prefix variable.  You also
may have to modify the FFTW_THREADS or FFTW_THREADLIB to indicate
which thread library you want to use and where it is located on your
system.

Once the Makefile is set up to your satisfaction, typing "make" will
build the libfftw_threads.a library.  Typing "make install" will
install the library and header files into the locations specified by
the prefix variable, and typing "make tests" will build two programs,
test_threads and time_threads, for testing and benchmarking the
library.

Alternatively, on a non-Unix system, you can eschew the Makefile and
compile everything using whatever interface is typical for your
compiler.  Note that you will need to #define FFTW_THREADS to indicate
which threads library you are using (see the Makefile for examples).

**** Usage: ****

This documentation assumes that the reader is already familiar with
the usage of the normal FFTW library.  c.f. the doc/ directory for
documentation of the standard FFTW.

To use the threads FFTW functions, you must first #include <fftw_threads.h>
in your program instead of #include <fftw.h>.

You should also link your program with -lfftw_threads -lfftw -lm, as
well as with any libraries that are required to use threads on your
system.

Before calling fftw_threads or fftwnd_threads (see below), you must
initialize the threads by calling fftw_threads_init exactly once:

fftw_threads_init();

Now, in your program, there are two new functions that you can call:
fftw_threads and fftwnd_threads.  These are exactly the same as fftw
and fftwnd, respectively, except that they take one extra parameter,
int nthreads, in the beginning of their parameter lists:

void fftw_threads(int nthreads,
	          fftw_plan plan, int howmany, fftw_complex *in, int istride,
        	  int idist, fftw_complex *out, int ostride, int odist);
void fftwnd_threads(int nthreads,
                    fftwnd_plan plan, int howmany,
                    fftw_complex *in, int istride, int idist,
                    fftw_complex *out, int ostride, int odist);

The nthreads parameter tells these functions to parallelize themselves
using nthreads concurrent threads of execution.  Thus, if you have 4
processors, passing 4 for nthreads will (ideally, for large
transforms) compute the transform 4 times as fast as the ordinary
fftw/fftwnd would have.  (You may use more threads than you have
processors, but this will probably slow down the transform.)

All the other parameters are exactly the same as for fftw/fftwnd.
(Note in particular that the *same plan* will work for both the
uniprocessor and the parallel transforms.)

* It is safe to call fftw_threaded and fftwnd_threaded in parallel
with themselves using the same plan (but different output arrays, of
course).  See also the section on thread-safety in the FFTW manual; it
is a good idea to use the FFTW_THREADSAFE flag in the plan.

**** What Is Parallelized and What Isn't: ****

If you call fftw_threads with howmany = 1, the one-dimensional
transform is parallelized.  [Note that you shouldn't expect much
speedup (and, in fact, should get slower execution) unless your array
size is at least 2^13 or so.]

If you call fftw_threads with howmany > 1, the multiple one-
dimensional transforms are executed in parallel; the individual
one-dimensional transforms are not parallelized.  IMPORTANT: it is the
caller's responsibility to ensure that the output arrays do not
overlap with each other or with the inputs, lest race conditions
result.

If you call fftwnd_threads with howmany = 1, the multi-dimensional
transform is parallelized.

If you call fftwnd_threads with howmany > 1, the multiple
multi-dimensional transforms are executed in *sequence,* with each
individual transform being parallelized.  This is non-optimal, and it
would be fairly easy to make the individual transforms occur in
parallel, but we haven't implemented that feature yet.  If this
capability is important to you, please let us know.

**** The Test Programs: ****

Typing "make tests" will make two programs, test_threads and
time_threads, which test the correctness of fftw_threads and benchmark
it against the uniprocessor version, respectively.

Usage: 
	test_threads nthreads
	time_threads nthreads

test_threads compares the output of the fftw_threads library with the
output of the fftw library for arrays with random sizes and ranks from
1 to 5.  The threads code is executed using nthreads concurrent
threads.  You should probably run this program with an nthreads of 1
and 2, at least, to ensure that things are working.  If your machine
runs out of memory when running the tests, you should modify
test_threads.c to use smaller arrays.  You do this by changing the
SIZE_MULT #define to a smaller number (you can also change it to a
bigger number if you want to test bigger problems and have enough
memory). 

time_threads benchmarks the fftw_threads library against the
uniprocessor fftw library for both one and three dimensional arrays of
a variety of sizes.  This is useful to see what kind of speedup you
are getting from using multiple processors.  The nthreads parameter
tells the program how many parallel threads to use.  If you have a
parallel machine, we would be interested to see the results of this
benchmark on your machine (email us at fftw@theory.lcs.mit.edu).
(Incidentally, the times that are outputted by this program are the
time in microseconds for a single transform divided by N lg N.)

WARNING: time_threads uses the same timing routine as fftw (see the
section on "Using high-resolution clocks" in the documentation).  This
defaults to clock().  However, on many systems, clock() is not the
same as the wall-clock time, and may return very deceptive results for
multi-threaded programs.  If this is a problem on your system, you may
want to replace references to fftw_get_time() in time_threads.c with
whatever timing routine you deem appropriate.  (Note: on Solaris
machines, if you compile with -DSOLARIS, fftw will use a high
resolution timer that produces correct results for multithreaded
programs.)
