

















NOW Handout Page 2



| Case Stud                                                                                                                  | y: I                   | MIP                        | S R4           | 4000                 | )                          |                                  |                                        |                                              | Ø  |
|----------------------------------------------------------------------------------------------------------------------------|------------------------|----------------------------|----------------|----------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------------|----|
| TWO Cycle<br>Load Latency                                                                                                  | IF                     | IS<br>IF                   | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |    |
| THREE Cycle<br>Branch Latency<br>(conditions evaluated<br>during EX phase)<br>Delay slot plus two<br>Branch likely cancels | IF<br>stalls<br>s dela | IS<br>IF<br><b>Ny slot</b> | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |    |
| 2/1/2005                                                                                                                   |                        | CS25                       | 2 SP05, I      | Lec 5 OO             | С                          |                                  |                                        |                                              | 15 |



| MIPS F         | PF   | Pipe    | e St    | age             | es      |       |      |                       |  |
|----------------|------|---------|---------|-----------------|---------|-------|------|-----------------------|--|
|                |      | -       |         | -               |         |       |      |                       |  |
| FP Instr       | 1    | 2       | 3       | 4               | 5       | 6     | 7    | 8                     |  |
| Add, Subtract  | U    | S+A     | A+R     | R+S             |         |       |      |                       |  |
| Multiply       | U    | E+M     | м       | М               | м       | Ν     | N+A  | R                     |  |
| Divide         | υ    | Α       | R       | D <sup>28</sup> |         | D+A   | D+R, | D+R, D+A, D+R, A, R   |  |
| Square root    | U    | Е       | (A+F    | <b>()</b> 108   |         | Α     | R    |                       |  |
| Negate         | υ    | s       |         |                 |         |       |      |                       |  |
| Absolute value | U    | s       |         |                 |         |       |      |                       |  |
| FP compare     | U    | Α       | R       |                 |         |       |      |                       |  |
| Stages:        |      |         |         |                 |         |       |      |                       |  |
| М              | Firs | st stag | ge of I | nultip          | olier   |       | Α    | Mantissa ADD stage    |  |
| N              | Sec  | cond s  | stage   | of mu           | Itiplie | er    | D    | Divide pipeline stage |  |
| R              | Ro   | undin   | g stag  | e               |         |       | Ε    | Exception test stage  |  |
| s              | Ор   | erand   | shift   | stage           |         |       |      |                       |  |
| U              | Unj  | pack I  | P nu    | mbers           | 5       |       |      |                       |  |
| 2/1/2005       |      |         | CS2     | 252 SP0         | )5, Lec | 5 OOC |      | 17                    |  |







| Loop: | LD                 | F0,0(R          | 1) ;F0=vector elem                                 | ment                       |  |
|-------|--------------------|-----------------|----------------------------------------------------|----------------------------|--|
|       | RDDD               | 0(21)           | 72 ;add Scalar ifd                                 | JII FZ                     |  |
|       | SUBI<br>BNEZ       | R1,R1,          | <pre>3 ;decrement poir<br/>5 ;branch R1!=zer</pre> | nter 8B (DW)<br>co         |  |
|       | NOP                |                 | ;delayed branch                                    | n slot                     |  |
| I     | nstruct<br>roducin | ion<br>g result | Instruction<br>using result                        | Latency in<br>clock cycles |  |
| F     | P ALU              | ор              | Another FP ALU op                                  | 3                          |  |
| F     | P ALU              | op              | Store double                                       | 2                          |  |
| L     | oad dou            | ıble            | FP ALU op                                          | 1                          |  |
| L     | oad dou            | ıble            | Store double                                       | 0                          |  |
| I     | nteger             | op              | Integer op                                         | 0                          |  |

| FP L               | oop                | Showi                             | ng Stall                 | S                               | 0  |
|--------------------|--------------------|-----------------------------------|--------------------------|---------------------------------|----|
| 1 Loop             | : LD               | F0,0(R1)                          | ;F0=vector               | element                         |    |
| 2                  | stall              |                                   |                          |                                 |    |
| 3                  | ADDD               | <b>F4</b> , <b>F0</b> , <b>F2</b> | ;add scala               | ar in F2                        |    |
| 4                  | stall              |                                   |                          |                                 |    |
| 5                  | stall              |                                   |                          |                                 |    |
| 6                  | SD                 | 0(R1), <mark>F4</mark>            | ;store res               | sult                            |    |
| 7                  | SUBI               | R1,R1,8                           | ;decrement               | t pointer 8B (DW)               |    |
| 8                  | BNEZ               | R1,Loop                           | ;branch Ri               | L!=zero                         |    |
| 9                  | stall              |                                   | ;delayed b               | oranch slot                     |    |
| Instruc<br>product | ction<br>ing resul | Instructi<br>t using res          | ion<br>sult<br>FP ALL on | Latency in<br>clock cycles<br>3 |    |
| FP ALL             | lon                | Store do                          | uble                     | 2                               |    |
| Load d             | ouble              | FP ALU o                          | 00                       | 1                               |    |
| • 9 clo            | cks: R             | ewrite co                         | de to mini               | mize stalls?                    |    |
| 2/1/2005           |                    | CS25                              | 52 SP05, Lec 5 O         | oc                              | 22 |

| Revise                           | d FP                       | Loop Mir         | nimiz      | zing Stalls                                 | ٢  |
|----------------------------------|----------------------------|------------------|------------|---------------------------------------------|----|
| 1 Loop:                          | LD                         | F0,0(R1)         |            |                                             |    |
| 2                                | stall                      |                  |            |                                             |    |
| 3                                | ADDD                       | F4,F0,F2         |            |                                             |    |
| 4                                | SUBI                       | R1,R1,8          |            |                                             |    |
| 5                                | BNEZ                       | R1,Loop ;de      | alayed     | branch                                      | l  |
| 6                                | SD                         | 8(R1), F4 ; al   | tered      | when move past SUBI                         |    |
| Swap BNE<br>Instruct<br>producin | Z and<br>tion<br>ng result | I SD by chai     | nging      | address of SD<br>Latency in<br>clock cycles |    |
| FP ALU                           | ор                         | Another FP A     | LU op      | 3                                           |    |
| FP ALU                           | ор                         | Store double     |            | 2                                           |    |
| Load do                          | uble                       | FP ALU op        |            | 1                                           |    |
| 6 clocks: Uni                    | roll looj                  | o 4 times code t | to mak     | e faster?                                   |    |
| 2/1/2005                         |                            | CS252 SP0        | 5, Lec 5 O | oc                                          | 23 |

|       | (0000) | gintion that  | a may)                        |
|-------|--------|---------------|-------------------------------|
| 1 Loc | p:LD   | F0,0(R1)      | 1 cycle stall Rewrite loop to |
| 2     | ADDD   | F4,F0,F2      | 2 cycles stall                |
| 3     | SD     | 0(R1),F4      | drop SUBI & BNEZ              |
| 4     | LD     | F6,-8(R1)     | -                             |
| 5     | ADDD   | F8, F6, F2    |                               |
| 6     | SD     | -8(R1),F8     | drop SUBI & BNEZ              |
| 7     | LD     | F10,-16(R1)   |                               |
| 8     | ADDD   | F12,F10,F2    |                               |
| 9     | SD     | -16(R1),F12   | drop SUBI & BNEZ              |
| 10    | LD     | F14,-24(R1)   |                               |
| 11    | ADDD   | F16,F14,F2    |                               |
| 12    | SD     | -24 (R1), F16 |                               |
| 13    | SUBI   | R1,R1,#32     | ;alter to 4*8                 |
| 14    | BNEZ   | R1,LOOP       |                               |
| 15    | NOP    |               |                               |

| Unrolled<br>Minimize | Loop That<br>s Stalls<br>F0,0(R1)<br>F6,-8(R1)<br>F10,-16(R1)<br>F14,-24(R1)<br>F4,F0,F2<br>F12,F10,F2<br>F12,F10,F2<br>F16,F14,F2<br>0(R1),F4<br>-8(R1),F4<br>-8(R1),F12<br>R1,LOOP<br>8(R1),F16 ; 1 | <ul> <li>What assumptions<br/>made when moved<br/>code?</li> <li>OK to move store past<br/>suBl even though changes<br/>register</li> <li>OK to move loads before<br/>stores: get right data?</li> <li>When is it safe for<br/>compiler to do such<br/>changes?</li> </ul> |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 14 clock c           | ycles, or 3.5 p                                                                                                                                                                                       | er iteration                                                                                                                                                                                                                                                               |
| 2/1/2005             | CS252 SP05                                                                                                                                                                                            | , Lec 5 OOC 25                                                                                                                                                                                                                                                             |





| Loop Unrolling in Superscalar           Integer instruction         FP instruction         Clock cycle           pop:         LD         F0.4[R1]         1           LD         F6.8[R1]         ADDD F3.90,F2         3           LD         F1424(R1)         ADDD F4.90,F2         3           LD         F1424(R1)         ADDD F15,F0,F2         4           LD         F18,32(R1)         ADDD F16,F14,F2         5           SD         0(R1),F4         ADDD F10,F14,F2         6           SD         -16(R1),F12         8         9           SUBI R1,R1,#40         10         10         SUEZ R1,LOOP         11           SD         -32(R1),F20         12         12         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)         12                                                           |      |                           |                           |             |    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|---------------------------|---------------------------|-------------|----|
| Integer instruction         FP instruction         Clock cycle           op:         LD         F6,8(R1)         2           LD         F10,-16(R1)         ADD F3,00,72         3           LD         F14,-32(R1)         ADD F4,00,72         3           LD         F14,-32(R1)         ADD F12,F10,F2         4           LD         F18,-32(R1)         ADDD F12,F10,F2         5           SD         0(R1),F4         ADDD F16,F14,F2         6           SD         -16(R1),F12         8         5           SUB         R1,R1,#40         10         10           BNEZ R1,L0OP         11         5         -32(R1),F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)         12                                                                                                              | Lo   | o <mark>p Unrollin</mark> | <mark>g in Superso</mark> | alar        |    |
| bop:         LD         F0.(R1)         1           LD         F6.8(R1)         2         2           LD         F1016(R1)         ADDD F3.00.F2         3           LD         F1424(R1)         ADDD F4.00.F2         3           LD         F1432(R1)         ADDD F12.F10.F2         5           SD         0(R1).F4         ADDD F12.F10.F2         5           SD         -16(R1).F12         8         5         -16(R1).F12         8           SUBI R1,R1,#40         10         BNEZ R1,LOOP         11         5         -32(R1).F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)         12 | h    | nteger instruction        | FP instruction            | Clock cycle |    |
| LD F6,-8(R1) 22<br>LD F10,-16(R1) ADDD F3,0,F2 33<br>LD F14,-24(R1) ADDD F3,6,F2 4<br>LD F18,-32(R1) ADDD F16,F14,F2 5<br>SD 0(R1),F4 ADDD F16,F14,F2 6<br>SD -16(R1),F12 8<br>SD -16(R1),F12 8<br>SD -24(R1),F16 9<br>SUBI R1,R1,#40 10<br>BNEZ R1,L00P 11<br>SD -32(R1),F20 12<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | : L  | D F0,0(R1)                |                           | 1           |    |
| LD F10,-16(R1) ADDD F3,0,F2 33<br>LD F14,-24(R1) ADDD F8,F6,F2 44<br>LD F14,-24(R1) ADDD F8,F6,F2 44<br>LD F18,-32(R1) ADDD F12,F10,F2 55<br>SD 0(R1),F1 ADDD F12,F10,F2 55<br>SD -48(R1),F10 ADDD F12,F10,F2 75<br>SD -46(R1),F12 85<br>SD -24(R1),F16 95<br>SUBI R1,R1,#40 100<br>BNEZ R1,LOOP 111<br>SD -32(R1),F20 122<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                        | L    | D F6,-8(R1)               | _                         | 2           |    |
| LD F14,-24(R1) ADDBF5,F6,F2 4<br>LD F18,-32(R1) ADDD F12,F10,F2 5<br>SD 0(R1),F4 ADDD F16,F14,F2 6<br>SD -8(R1),F8 ADDD F16,F14,F2 7<br>SD -16(R1),F12 8<br>SUB R1,R1,H40 10<br>BNEZ R1,LOOP 11<br>SD -32(R1),F20 12<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | L    | D F10,-16(R1)             | ADDD F4, 0, F2            | 3           |    |
| LD F18,-32(R1) ADDD F12,F10,F2 5<br>SD 0(R1),F1 ADDD F16,F14,F2 6<br>SD -8(R1),F9 ADDD F16,F14,F2 7<br>SD -16(R1),F12 8<br>SUBI R1,R1,#40 10<br>BNEZ R1,L00P 11<br>SD -32(R1),F20 12<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | L    | D F14,-24(R1)             | ADDD F8,F6,F2             | 4           |    |
| SD         0(R1),F4         ADDD F16,F14,F2         6           SD         -8(R1),F8         ADDD F20,F18,F2         7           SD         -16(R1),F12         8         8           SD         -24(R1),F16         9         9           SUBI         R1,R1,#40         10         10           BNEZ         R1,LOOP         11         50           SD         -32(R1),F20         12         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)         12                                                                                                                                                                                                                                                                                                                                                        | L    | D F18,-32(R1)             | ADDD F12,F10,F2           | 5           |    |
| SD         -8(R1),F9         ADDD F20,F18,F2         7           SD         -16(R1),F12         8         8           SD         -24(R1),F16         9         9           SUBI R1,R1,#40         10         10           BNEZ R1,LOOP         11         50         -32(R1),F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)         12                                                                                                                                                                                                                                                                                                                                                                                                                                                                | S    | D 0(R1),F4                | ADDD F16,F14,F2           | 6           |    |
| SD         -16(R1),F12         8           SD         -24(R1),F16         9           SUBI R1,R1,#40         10           BNEZ R1,LOP         11           SD         -32(R1),F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | S    | D -8(R1),F8               | ADDD F20,F18,F2           | 7           |    |
| SD         -24(R1),F16         9           SUBI         R1,R1,#40         10           BNEZ         R1,LOOP         11           SD         -32(R1),F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | S    | D -16(R1),F12             |                           | 8           |    |
| SUBI R1,R1,#40         10           BNEZ R1,LOOP         11           SD -32(R1),F20         12           Unrolled 5 times to avoid delays (+1 due to SS)         12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | S    | D -24(R1),F16             |                           | 9           |    |
| BNEZ R1,LOOP 11<br>SD -32(R1),F20 12<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | S    | UBI R1,R1,#40             |                           | 10          |    |
| SD -32(R1),F20 12<br>Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | E    | NEZ R1,LOOP               |                           | 11          |    |
| Unrolled 5 times to avoid delays (+1 due to SS)<br>12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | S    | D -32(R1),F20             |                           | 12          |    |
| 12 clocks, or 2.4 clocks per iteration (1.5X)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | nrol | led 5 times to            | avoid delays (+1          | due to SS)  |    |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | clc  | cks, or 2.4 clo           | cks per iteration         | (1.5X)      |    |
| 2/1/2005 CS252 SP05, Lec 5 OOC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2005 | c.                        | CS252 SP05, Lec 5 OOC     |             | 28 |



|                |                | 50              | 50                      | Int and 0     |          |
|----------------|----------------|-----------------|-------------------------|---------------|----------|
| reference 1    | reference 2    | operation 1     | г <del>Р</del><br>ор. 2 | branch        | IOCK     |
| LD F0.0(R1)    | LD F6,-8(R1)   | -               |                         |               | 1        |
| LD F10,-16(R1) | LD F14,-24(R1) |                 |                         |               | 2        |
| LD F18,-32(R1) | LD F22,-40(R1) | ADDD F4,F0,F2   | ADDD F8,F6,I            | 2             | 3        |
| LD F26,-48(R1) |                | ADDD F12,F10,F2 | ADDD F16,F1             | 4,F2          | 4        |
|                |                | ADDD F20,F18,F2 | ADDD F24,F2             | 2,F2          | 5        |
| SD 0(R1),F4    | SD -8(R1),F8   | ADDD F28,F26,F2 |                         |               | 6        |
| SD -16(R1),F12 | SD -24(R1),F16 |                 |                         |               | 7        |
| SD -32(R1),F20 | SD -40(R1),F24 |                 |                         | SUBI R1,R1,#4 | 88       |
| SD -0(R1),F28  |                |                 |                         | BNEZ R1,LOOP  | 9        |
| Unrolled       | 7 times to     | avoid delays    | 5                       |               |          |
| 7 results      | in 9 clocks    | s. or 1.3 cloc  | ks per ite              | ration (1.8)  | 0        |
|                |                |                 |                         |               | <b>'</b> |

## NOW Handout Page 5

## Summary

- Increasingly powerful (and complex) dynamic mechanism for detecting and resolving hazards
  - In-order pipeline, in-order op-fetch with register reservations, in-order issue with scoreboard

31

- Weaken the timing and flow assumptions Allow later instructions to proceed around ones that are stalled
- Facilitate multiple issue
- Not quite powerful enough to unroll loops dynamically » Stop when attempt to rebind a new value to a reg.
- · Compiler techniques make it easier for HW to find the ILP
  - Reduces the impact of more sophisticated organization
  - Requires a larger architected namespace
  - Easier for more structured code

2/1/2005

CS252 SP05, Lec 5 OOC