PowerPC 620: Case Study

The PPC620 is an implementation of the 64-bit version of the PowerPC architecture.

Other recent processors with similar features include the MIPS R10000 and HPPA 8000. The DEC Alpha 21164 and UltraSPARC are also multi-issue architectures, not quite as aggressive as the PPC620.

Features

Four way superscalar architecture
can fetch, issue, and complete up to 4 instructions per cycle.
Six independent functional units:
Two Simple Integer units, XSU0 and XSU1.
handles simple integer operations (add, subtract, logical) with a single cycle latency.
One Complex Integer unit, MCFXU.
handles integer multiply and divide with latency of 3 to 20 cycles. Multiply is fully pipelined; divide is unpipelined.
One Load/Store unit, LSU.
handles all load and store instructions and includes its own EA adder. This unit includes both the load and store buffers and disambiguates memory references internally. The store buffer is really two buffers - stores waiting for EA operands, and stores waiting for commit. The load buffer allows one outstanding cache miss to be processed while other loads and stores proceed. Subsequent cache misses are returned to the reservation station, allowing up to 3 misses to occur before the unit completely stalls. The cache is dual ported (two banks) to allow up to two operations to proceed in parallel.
One Floating Point unit, FPU.
handles all FP operations with a latency of 2 cycles for multiply, add and multiply-add (3-stage pipeline), 31 for divide (unpipelined).
One Branch Prediction unit, BPU.
predicts and completes branch instructions. Includes the condition code register used in the PPC architecture.
Supports hardware speculative execution
similar to what we have seen except that uncommitted results are stored in the store buffer or a set of renaming registers instead of the reorder buffer.
Five stage pipeline
Fetch
Fetches 4 instructions per cycle and updates the PC. Includes a 256 two-way set associative entry Branch Target Buffer and a 2048 entry Branch Prediction Buffer; both updated by the BPU. Also includes a return address stack.
Decode
Decodes 4 instructions and prepares them for issue.
Issue
Issues up to 4 instructions to reservation stations, allocating a rename register for the result and a reorder buffer entry.
Execute
When operands are available, issue to the functional unit for computation. When results are available, they are broadcast on one of the result busses (CDB), thus written in any reservation station awaiting them and in the rename buffer. If the instruction is a mispredicted branch, the fetch and competion units are informed. In any case, the completion unit is informed.
Commit When all previous instructions have completed, up to 4 instructions can complete in one cycle by updating the register file from the rename buffer and freeing rename and reorder buffers freed. For store instructions the LSU is informed to send store results to the cache.

Performance


[up] to Overview.