



# REDSOC: Recycling Data Slack in OOO Cores

**Gokul Subramanian Ravi** 

Prof. Mikko Lipasti

University of Wisconsin - Madison







Optimized slack-aware 000 scheduler 2

DATA SLACK

### **BACKGROUND & CLASSIFICATION**



### Arithmetic Logic Unit





### 16-bit Kogge Stone Adder





VADD.I16 Q0, Q1, Q2<sup>[1]</sup>

Details in paper!

# Identifying data slack at decoder: instruction / prediction / lookup table



**EXECUTE STAGE** 

### TRANSPARENT DATAFLOW

# Transparence on a Dataflow Graph



2/22/19

### Transparent Dataflow with Synchronous Control



2/22/19

INSTRUCTION SCHEDULER

### **SLACK INCORPORATED SCHEDULING**

### Slack-efficient Scheduler: Motivation



# Slack-aware Scheduler: Proposal

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 (1) Slack Accumulation Tracking and accumulating slack over dataflow graphs 2 Eager Grandparent Wakeup Speculative child wakeup via grand parent tags[Stark, 2000] 3 Skewed Selection logic Non-speculative parent preferred over speculative child

2/22/19

© Gokul Ravi

# Scheduling Microarchitecture (illustrative)



### **Scheduling Microarchitecture (Operational)**



**EVALUATION** 

### **METHODOLOGY AND RESULTS**

# **Experimental Setup**

#### Methodology:

- Perf & Power: Gem5 + McPAT
- Slack analysis: Synthesis on Synopsys Design Compiler
- > Freq & Tech: 2 GHz, 45nm

#### Baseline:

- > 3 sizes of OOO cores
  - Front-end: 3/4/8
  - Execution: 3/4/6
  - 64K L1 IC/DC, 2M L2

#### Benchmarks:

Compute intensive benchmarks from SPEC CPU2006 / MiBench / ARM Compute Library (ARM ISA)

# **Benchmark Operation Distribution**



# Speedup over different cores



# Comparison with other proposals



#### Timing Speculation

- Increase frequency at fixed voltage, with timing errors
- Error detection, coarse tuning granularity, potential error every operation

#### Operation Fusion

- Pairs of operations in single (standard) clock cycle
- Low opportunity, costly compiler or h/w optimization

# **REDSOC** Conclusions

#### I hope I convinced you that

- ➤ Data Slack is considerable
- > Transparent dataflow designs are attractive
- OOO modifiable with reasonable overheads

#### Advantages

- ➤ No timing errors or detection
- Instruction granularity control
- > Traditional cores/apps/compilers

#### Future work:

- Slack between FUs
- Approximate
- Other parts of core
- Other processing designs

#### Results

- ➤ 5 to 25% performance improvement
- > 2x to 6x more efficient than prior proposals



2/22/19

# Backup slides

# **Prior Proposals**



| Prior Work                                    | Description                                              | Limitations                                                              |  |
|-----------------------------------------------|----------------------------------------------------------|--------------------------------------------------------------------------|--|
| Elastic Pipelines<br>[Nowick, 2011]           | Asynchronous Blocks w/ handshake mechanisms              | Completion detection / handshake overheads. High sync. integration costs |  |
| Specialized Data-<br>paths<br>[Sampson, 2011] | Single 'slow' cycle executing chained combinational ops. | Poor flexibility, low throughput or replication overheads                |  |
| Operation Fusion<br>[Park, 2009]              | Sequence of operations in single (standard) clock cycle  | Low opportunity, costly compiler or h/w optimization                     |  |
| Synchronous Timing Speculation                | Increase/decrease frequency/voltage, with timing errors  | Error detection/recovery, high tuning overhead, potential error          |  |

Need for aggressive solutions w/ low (or no) risk, suited to general purpose compute!!

# **Processor Configurations**

| Parameter       | Small | Medium | Big   |
|-----------------|-------|--------|-------|
| Frequency       | 2 GHz | 2 GHz  | 2 GHz |
| Front-End Width | 3     | 4      | 8     |
| ROB Size        | 40    | 80     | 160   |
| LSQ Size        | 16    | 32     | 64    |
| RSEs            | 32    | 64     | 128   |
| ALUs            | 3     | 4      | 6     |
| L1 I/D Cache    | 64 kB | 64 kB  | 64 kB |
| L2 Cache        | 2 MB  | 2 MB   | 2 MB  |

# Low-precision GEMM library





~50% of operations show 25-50% data slack!!

### **PVT Slack**



#### Process:

- ➤ Manufacturing variability
- $ightharpoonup V_{th}$ ,  $L_{gate}$

#### Voltage:

- Current fluctuations
- Workload activity

#### Temperature:

- > Hotspots
- Electron collisions

### Prior Proposal #1: Timing Speculation

Increase frequency OR reduce voltage allowing *some* timing errors to occur.



- Requires costly timing error detection, recovery mechanisms.
- Only allows coarse grained control hence speculation is conservative (for low ER).

### Prior Proposal #2: Specialized Data Paths

Multi-cycle data path with sequence of combinational events executed in one "slow tick"



- Poor throughput or significant replication overheads
- No flexibility for general purpose processing

### Prior Proposal #3: Operation Fusion

Squeeze a sequence of operations into a single (standard) clock cycle



- Low opportunity in un-optimized code
- Costly compiler or hardware optimizations to attempt significant operations reordering

# Comparison with other proposals





# Data Slack Classification

#### LP-GEMM



### Operation Slack:

- > Encoded within instruction
- > Obtained via decode



# Data Slack Classification





- Operation Slack:
  - > Encoded within instruction
  - ➤ Obtained via decode
- Data-Type Slack:
  - > Encoded within instruction
  - Obtained via decode



# Data Slack Classification

LP-GEMM

- Operation Slack:
  - > Encoded within instruction
  - ➤ Obtained via decode
- Data-Type Slack:
  - > Encoded within instruction
  - ➤ Obtained via decode



- Data-Width Slack:
  - > From operands (too late)
  - ➤ Predict at decode<sup>[Loh, 2002]</sup>
  - Verify at execute



# Skewed Select logic





# **Overheads**

- Decode:
  - Width predictor uses 1.5KB of state
  - Area/Energy 0.5% of core
- Execute:
  - Negligible
- Scheduler:
  - Slack computations 3-bits wide
  - Operational design is 10 extra bits per RSE
  - Area/Energy overhead: 0.3/0.8%
  - Increase in scheduler delay is 1.5% (pessimistic)

# Timing Closure in Execute

 Traditional timing paths (in a standard FF design) to analyze for timing closure would be (F1i–F1o), (F2i–F2o), (F1o–F2o) and (F2o – F1o).

For transparent, these would be (F1i –F2o) and (F2o –F2o) when M12 is enabled for transparent dataflow. Similarly, there would be (F2i – F1o) and (F1o – F1o) when M21 is enabled for transparent dataflow.