Superscalar
In a superscalar architecture, from two to eight independent
pipelines are available for instruction issue each cycle.
For DLX, consider two pipelines, one for integer instructions
and one for FP instructions.
| Integer Instruction |
 |
 |
 |
 |
 |
| FP Instruction |
 |
 |
 |
 |
 |
| Integer Instruction |
|
 |
 |
 |
 |
 |
| FP Instruction |
|
 |
 |
 |
 |
 |
| Integer Instruction |
|
|
 |
 |
 |
 |
 |
| FP Instruction |
|
|
 |
 |
 |
 |
 |
But this increases hardware requirements:
Instruction bandwidth is double
- each cycle we must fetch 64 bits of instruction
Hazard detection is more complex
- we must look for dependencies between the two instructions
fetched. The issue logic must decide whether it can issue
one or both of the instructions fetched.
As well as the burden on the compiler
Statically schedule code so that instructions fall on a
64 bit boundary with an integer and FP instruction wherever possible.
Consider our vector/scalar add code:
Loop: ld f0, 0(r1)
addd f4, f0, f2
sd 0(r1), f4
subi r1, r1, #8
bnez r1, Loop
To effectively make use of the superscalar pipelines without delays,
we need to unroll the loop 5 times:
| Integer Instruction | FP Instruction | Cycle |
| Loop: | ld | f0, 0(r1) | | | 1 |
| ld | f6, -8(r1) | | | 2 |
| ld | f10, -16(r1) | addd | f4,f0,f2 | 3 |
| ld | f14, -16(r1) | addd | f8,f6,f2 | 4 |
| ld | f18, -24(r1) | addd | f12,f10,f2 | 5 |
| sd | 0(r1),f4 | addd | f16,f14,f2 | 6 |
| sd | -8(r1),f8 | addd | f20,f18,f2 | 7 |
| sd | -16(r0),f12 | | | 8 |
| sd | -24(r0),f16 | | | 9 |
| sd | -32(r0),f20 | | | 10 |
| subi | r1,r1,#40 | | | 11 |
| bnez | r1,Loop | | | 12 |
We are doing 5 iterations of the loop in 12 cycles = 2.4 cycles per iteration.
Code Scheduling may be
Static
- though we can rely on the issue logic to handle stuctural hazards
(i.e. we don't need extra nops).
Dynamic
-
using a variation of the
Tomasulo algorithm
- Use separate reservation stations/CDB's for integer and FP functional
units.
- Load/Store reservation stations become load queue
- Must disambiquate memory aliases:
- Load checks Store queue to avoid RAW hazards
- Store checks Load queue to avoid WAW and WAR hazards
Dynamic performance
| Iteration | Instruction | Issue | Executes | Write result |
| 1 | ld f0,0(r1) |
1 |
2 |
4 |
| 1 | addd f4,f0,f2 |
1 |
5 |
8 |
| 1 | sd 0(r1),f4 |
2 |
9 |
|
| 1 | subi r1,r1,#8 |
3 |
4 |
5 |
| 1 | bnez r1, Loop |
4 |
5 |
|
| 2 | ld f0,0(r1) |
5 |
6 |
8 |
| 2 | addd f4,f0,f2 |
5 |
9 |
12 |
| 2 | sd 0(r1),f4 |
6 |
13 |
|
| 2 | subi r1,r1,#8 |
7 |
8 |
9 |
| 2 | bnez r1, Loop |
8 |
9 |
|
About 4 clocks per iteration.
Advantages/Disadvantages
Existing code will still run on superscalar implementation
May be scheduled statically or dynamically (Tomasulo)
Need additional register file ports
More complex hazard detection (2 opcodes, 6 register specifiers)
Load and branch slots now 3
To achieve 0.5 CPI, 50% of instructions must be FP with no hazards
to Multiple Issue.