Superscalar

In a superscalar architecture, from two to eight independent pipelines are available for instruction issue each cycle.

For DLX, consider two pipelines, one for integer instructions and one for FP instructions.
Integer Instruction
FP Instruction
Integer Instruction
FP Instruction
Integer Instruction
FP Instruction

But this increases hardware requirements:

Instruction bandwidth is double
each cycle we must fetch 64 bits of instruction
Hazard detection is more complex
we must look for dependencies between the two instructions fetched. The issue logic must decide whether it can issue one or both of the instructions fetched.

As well as the burden on the compiler

Statically schedule code so that instructions fall on a 64 bit boundary with an integer and FP instruction wherever possible.

Consider our vector/scalar add code:

     Loop:  ld     f0, 0(r1)
            addd   f4, f0, f2
            sd     0(r1), f4
            subi   r1, r1, #8
            bnez   r1, Loop
To effectively make use of the superscalar pipelines without delays, we need to unroll the loop 5 times:
Integer InstructionFP InstructionCycle
Loop:ldf0, 0(r1)1
ldf6, -8(r1)2
ldf10, -16(r1)adddf4,f0,f23
ldf14, -16(r1)adddf8,f6,f24
ldf18, -24(r1)adddf12,f10,f25
sd0(r1),f4adddf16,f14,f26
sd-8(r1),f8adddf20,f18,f27
sd-16(r0),f128
sd-24(r0),f169
sd-32(r0),f2010
subir1,r1,#4011
bnezr1,Loop12

We are doing 5 iterations of the loop in 12 cycles = 2.4 cycles per iteration.

Code Scheduling may be

Static
though we can rely on the issue logic to handle stuctural hazards (i.e. we don't need extra nops).
Dynamic
using a variation of the Tomasulo algorithm

Dynamic performance

IterationInstructionIssueExecutesWrite result
1ld f0,0(r1) 1 2 4
1addd f4,f0,f2 1 5 8
1sd 0(r1),f4 2 9
1subi r1,r1,#8 3 4 5
1bnez r1, Loop 4 5
2ld f0,0(r1) 5 6 8
2addd f4,f0,f2 5 9 12
2sd 0(r1),f4 6 13
2subi r1,r1,#8 7 8 9
2bnez r1, Loop 8 9

About 4 clocks per iteration.

Advantages/Disadvantages

Existing code will still run on superscalar implementation
May be scheduled statically or dynamically (Tomasulo)
Need additional register file ports
More complex hazard detection (2 opcodes, 6 register specifiers)
Load and branch slots now 3
To achieve 0.5 CPI, 50% of instructions must be FP with no hazards

[up] to Multiple Issue.