Software Architecture of the Mark-II Performance Simulator

This document describes the design we anticipate for the Mark-II performance simulator.
Components.	The components of the new performance simulator are: Lexer & Parser Control Unit Issue Unit Functional Units Memory Functional Unit Flag Functional Unit Arithmetic Functional Unit 1 w/floating point Arithmetic Functional Unit 2 Memory system Translation system
Usage of the Simulator:
The Front End, including Lexer & Parser.	The old performance simulator did not rely on traces; it had a complete processor model which essentially forced it into a cycle-at-a-time model. We are going to use as input to the simulator the `vsim-isa -verify_trace` traces, as well as the binaries themselves. This will free us from having to monitor the actual control flow of the program, which will allow us to, in many cases, simulate more than one cycle at a time (because the simulator's own control flow will be much less data-directed.) The entry point to the simulator is essentially a lexical analyzer and context-free grammar parser for the `verify_trace` records, which passes trace records as data structures to the rest of the simulator. This could effectively be split off into a separate trace preprocessor, if we determine later that it would help. One useful feature of the old simulator was to be able to analyze the performance only of certain functions or regions of code (the `perf` and `phony` commands.) We want to preserve this functionality, so we will either want the binary or a preprocessed disassembly of the binary (to be determined) as input.
Queues:
The Control Unit.	It will be passed instructions by the parser. It will have a queue of 4 entries to take care of a faster produce rate from the scalar core. It will also retire any instructions that deal with modifying the vector control registers. It will return to the parser a stall with the number of cycles it needs to stall if its queue is full. To figure out the number of stall cycles, it will have to figure out how many cycles it needs to retire the instruction (if it retires in the control unit) or talk to the issue unit to figure out when the queue will have an empty slot so it can move the current instructions. Also, when not simulating the whole program. The PCs for which the instruction isn't going to be simulated, the Control Unit can just retire them in it. The control unit will also attach, the vector control information (vpw, vl etc) with each instruction as it passes it to the Issue Unit.
The Issue Unit.	It will have a queue of 8 entries. The Issue Unit figure out from each of the functional units how long it has to stall before it can issue the next instruction in the queue to the particular issue unit. It will also, return the number of stalls to the control unit if its queue is full. To figure out the number of stall cycles, it needs to ask the particular functional unit for the stall cycles. It will also control the structural hazards, in case the functional units are used up. It will also have to decide when to issue instructions depending on how many cycles per pipeline stage it takes and whether the next instruction is pipelined at all or not.
Functional Units.
The Memory FU.	It will be simulated by simulating the first 4 stages of the pipeline per cycle so we can get detect and relay information about stalls as necessary. This will be quick and will only happen when the memory pipeline is being used. It will also control the other functional units for stalling. It will get its stall cycles for the fourth stage from the memory unit and the also from the translation unit in the third stage. It will than pass the issue unit the max. of the two as the functional units stall cycles.
The Arithmetic Units.	These will basically simulate only how many cycles per instructions are left for the instructions and when it has passed enough stages that a new instruction can be issued to the unit. It will keep a queue of the instructions being issued to it and how many cycles are left for each of them. Along with the number of cycles remaining per instruction, we're going to have a measure of the number of cycles remaining before the next ALU instruction to be issued can be "chained".
The Flag Unit.	This is similar to the arithmetic units just keeping track of the cycles for each instructions left to finish.
Memory system.	The memory system will keep a queue per bank (or sub-bank as the case may be.) It will return the stall cycles for the memory accesses on the same bank to the memory functional unit. It will also have to deal with any cache fills from the scalar core cache misses and any virtual memory macro TLB misses that need to be loaded up from memory. The stall cycles will be passed back for each of the memory accesses to the system as you always know how many cycles the current memory access is going to take and the how many cycles each one in front in the queue of the given memory access will take. This can then be used by the memory functional unit to stall the issue unit and the other units.
Virtual to Physical Translation system.	We need to look into the exact details of the the translation system used in VIRAM. It seems to be using a micro- and a macro-TLB with any misses in the macro-TLB going to the memory page table for lookup.
Metrics.	Preliminary list of things to keep track of and issue statistics for: Stalls (and reasons for them: queue fills, memory conflicts, TLB, data hazards). Memory bandwidth used vs. theoretical max. % vectorization in performance-analyzed sections of code. MIPS, FLOPS. Preliminary list of things to parameterize on: Number of vector lanes, subbanks in DRAM, queue size for issue unit.
Questionnaire Answers.
What's Done.	We have a working lexer and parser, and we have a main loop which calls stub versions of the implementations for the various functional units. Confusingly, the SGI C++ compiler (on i.cs) doesn't like it very much, but it seems to work OK with GNU C++. This will probably be worked out shortly.
Expert Advice.	We have talked with Christoforos about the multi-cycle simulation, and he warned us that simulating the memory functional unit on a higher level than cycle-by-cycle can be tricky. We are trying to work out how to precompute enough information so that little work has to be done on a cycle-by-cycle basis.
Still Needed.	We still have to get from him the precise timings of the various instructions, and we're still working on our lists of metrics to calculate. Hopefully this week we can talk to the various other IRAM benchmarkers in some more depth. If not we'll pursue this by email.
Next Meeting.	By the next meeting we plan to have working versions of all the functional units, and we should be ready to begin collecting performance results and correlating them to results from the old performance simulator.

Maintained by brg at eecs.berkeley.edu. Last modified on Monday, 07-Oct-2002 20:07:26 PDT.