GUPS
----

This is code to evaluate an architecture with respect to the NSA
Giga-Updates per Second benchmark. It was written by Brian R. Gaeke
<brg@eecs.berkeley.edu> at UC Berkeley, and is not endorsed by NSA.

Hints for compiling
-------------------

This is a research code, and a benchmark, so it's not set up to have
sane software configuration. Hack the Makefile, header files, and code
until they work well for your system. Feel free to mail me any hints
you have and I will archive them on the Web for people to see.
Remember the AT&T Unix support policy: As-is, no advertising, no
support, no bug fixes, payment in advance.

You will need GNU Make in order to use the Makefile, or you can write
your own.

It is useful to be able to access very large pointers. On SGI, you need
the -64 flag for this.

Modify gups_types.h to typedef uintN as an N-bit type, for all N in {8,
16, 32, 64}, and the program should get the rest straight.

You will need the exp() math call. Link with -lm (the standard math
library) for this. (Don't worry, it's only called from the startup code.)

Versions
--------

gups_indexserial and gups_serial are both derived from the code in
gups_serial.c, which fetches, modifies and updates one data element
after another, serially. gups_indexserial computes an array of indices
first, whereas gups_serial computes them on the fly.

gups_parallel is a version for the Cray C compiler, which has been
optimized for easy vectorization.  It is intended to be a data-parallel
version of the gups_serial code, and if the compiler is willing, it
will fetch whole random vectors at once using indexed loads, modify
them all by adding one to each element, and then store them all back
again at once with an indexed store.

gups_skeleton is a copy of gups_parallel, which has the actual C code
for the gups routines removed, so that you can easily replace them with
hand-optimized assembly routines (in separate object files, that is.)

annotated_asm is unoptimized assembly code for the GUPS routines for
the VIRAM-1, with comments added. optimized_asm is the same thing, but
hand-optimized to remove some unnecessary code and account for
assumptions we made that the compiler couldn't prove.

Description of the input file
-----------------------------

The input file contains three numbers, in free-form format: first,
the number of iterations of the algorithm to run, which must be greater
than zero; second, the exponent (N) described in the algorithm, giving
the log to the base 2 of the buffer to use, as a floating-point number,
which must be greater than 8.0; and third, the width of the data type
to use, which must be 8, 16, 32, or 64.

For example, an input file containing

	5000000 16 8

would mean to run 5 million random updates over a 64 Kbyte buffer full
of chars.

Note that the number of iterations and the size of the buffer to use are
both rounded down to the next lowest multiple of 256.

Description of the algorithm
----------------------------

"Read a memory location, update it, and write it back.  Next, pick another
memory location (but do it randomly), update it, and write it back.
And do this over a range of memory locations that span a range of 1 to
2 to the N, where N is a big number, like 30, or 32, or 34."

	-- Candace S. Culhane <csculha@madweed.ncsc.mil>, 6 April, 2001

Random number generators
------------------------

The code will either use the system random() generator, or it will use a
linear feedback shift register-based random number generator, which has the
handy property that overlapping accesses to the same array element are
minimized.

The DIS Stressmark random number generator could also be used
instead of random().

Improvements to make
--------------------

The use of exp could be avoided, for portability's sake, with an
approximation.

Some sort of autoconfiguration code could be written (or Autoconf
could be used) to automatically find out which types should be which
in gups_types.h, given the compiler flags the user wants to use.

Calculate all the addresses ahead of time (i.e., instead of calculating
an array of indices ahead of time, calculate an array of addresses.)

Parallelization improvements & strategies
-----------------------------------------

The size of the field vector will be some nonzero multiple of 256.
(This assumption has been implemented.)

The number of iterations will be some nonzero multiple of 256.
(This assumption has been implemented.)

The processor should switch vpw to 32 to load the indices, and then
switch vpw to the data width again afterwards. (This hasn't been
implemented.)

Overflow doesn't matter. We shall use field[indices[i]] ^= 1 (toggle
lowest bit) instead of adding 1. (This is not in the current version of
the code because it was found not to have an impact on performance.)