GUPS ---- This is code to evaluate an architecture with respect to the NSA Giga-Updates per Second benchmark. It was written by Brian R. Gaeke at UC Berkeley, and is not endorsed by NSA. Hints for compiling ------------------- This is a research code, and a benchmark, so it's not set up to have sane software configuration. Hack the Makefile, header files, and code until they work well for your system. Feel free to mail me any hints you have and I will archive them on the Web for people to see. Remember the AT&T Unix support policy: As-is, no advertising, no support, no bug fixes, payment in advance. You will need GNU Make in order to use the Makefile, or you can write your own. It is useful to be able to access very large pointers. On SGI, you need the -64 flag for this. Modify gups_types.h to typedef uintN as an N-bit type, for all N in {8, 16, 32, 64}, and the program should get the rest straight. You will need the exp() math call. Link with -lm (the standard math library) for this. (Don't worry, it's only called from the startup code.) Versions -------- gups_indexserial and gups_serial are both derived from the code in gups_serial.c, which fetches, modifies and updates one data element after another, serially. gups_indexserial computes an array of indices first, whereas gups_serial computes them on the fly. gups_parallel is a version for the Cray C compiler, which has been optimized for easy vectorization. It is intended to be a data-parallel version of the gups_serial code, and if the compiler is willing, it will fetch whole random vectors at once using indexed loads, modify them all by adding one to each element, and then store them all back again at once with an indexed store. gups_skeleton is a copy of gups_parallel, which has the actual C code for the gups routines removed, so that you can easily replace them with hand-optimized assembly routines (in separate object files, that is.) annotated_asm is unoptimized assembly code for the GUPS routines for the VIRAM-1, with comments added. optimized_asm is the same thing, but hand-optimized to remove some unnecessary code and account for assumptions we made that the compiler couldn't prove. Description of the input file ----------------------------- The input file contains three numbers, in free-form format: first, the number of iterations of the algorithm to run, which must be greater than zero; second, the exponent (N) described in the algorithm, giving the log to the base 2 of the buffer to use, as a floating-point number, which must be greater than 8.0; and third, the width of the data type to use, which must be 8, 16, 32, or 64. For example, an input file containing 5000000 16 8 would mean to run 5 million random updates over a 64 Kbyte buffer full of chars. Note that the number of iterations and the size of the buffer to use are both rounded down to the next lowest multiple of 256. Description of the algorithm ---------------------------- "Read a memory location, update it, and write it back. Next, pick another memory location (but do it randomly), update it, and write it back. And do this over a range of memory locations that span a range of 1 to 2 to the N, where N is a big number, like 30, or 32, or 34." -- Candace S. Culhane , 6 April, 2001 Random number generators ------------------------ The code will either use the system random() generator, or it will use a linear feedback shift register-based random number generator, which has the handy property that overlapping accesses to the same array element are minimized. The DIS Stressmark random number generator could also be used instead of random(). Improvements to make -------------------- The use of exp could be avoided, for portability's sake, with an approximation. Some sort of autoconfiguration code could be written (or Autoconf could be used) to automatically find out which types should be which in gups_types.h, given the compiler flags the user wants to use. Calculate all the addresses ahead of time (i.e., instead of calculating an array of indices ahead of time, calculate an array of addresses.) Parallelization improvements & strategies ----------------------------------------- The size of the field vector will be some nonzero multiple of 256. (This assumption has been implemented.) The number of iterations will be some nonzero multiple of 256. (This assumption has been implemented.) The processor should switch vpw to 32 to load the indices, and then switch vpw to the data width again afterwards. (This hasn't been implemented.) Overflow doesn't matter. We shall use field[indices[i]] ^= 1 (toggle lowest bit) instead of adding 1. (This is not in the current version of the code because it was found not to have an impact on performance.)