Skip to content

Commit

Permalink
WIP: adding energy efficiency discussion
Browse files Browse the repository at this point in the history
  • Loading branch information
Ravenwater committed Jan 5, 2025
1 parent 9d026dc commit db044c3
Show file tree
Hide file tree
Showing 3 changed files with 217 additions and 5 deletions.
73 changes: 73 additions & 0 deletions content/design/energy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
+++
prev = "/design/space"
weight = 8
title = "Energy Efficiency"
toc = true
next = "/design/switching-energy"
date = "2025-01-05T13:39:38-05:00"

+++

Table 1 shows switching energy estimates of key computational events by process node.
Data movement operations (reads and writes) have started to dominate energy consumption
in modern processors. This makes a Stored Program Machine (SPM) less and less efficient.
To counter this, all CPUs, GPUs, and DSPs have started to add instructions that amortize
instruction processing among more computational intensity: they have all become **SIMD**
machines.

Fundamentally, the SPM relies on a request/reply protocol to get information from a memory.
Otherwise stated, the resource contention mechanism deployed by a SPM uses a random
access memory to store inputs, intermediate, and output values. And all this memory
management uses this request/reply cycle. Which we now know is becoming less and less
energy efficient compared to the actual computational event the algorithm requires.
The sequential processing model is becoming less and less energy efficient.

We see that the further the memory is from the ALUs the worse the energy imbalance is.
This has spawn Processor-In-Memory (PIM) and In-Memory-Compute (IMC) structures where
the processing elements are multiplied and pushed into the memory. This improves the
energy efficiency of the request/reply cycle, but it complicates the data distribution
problem.

Fine-grained data paths so common in real-time designs are more energy efficient than
their SPM counterparts because they do not rely on the request/reply cycle. Instead, they
operate in a pipeline fashion with parallel operational units directly writing results
to the next stage, removing the reliance on random access memory to orchestrate
computational schedules.

We have seen the Data Flow Machine (DFM) maintaining fine-grain parallelism using a finite
number of processing elements. Unfortunately, the basic operation of the DFM
is less efficient than the basic operation of the SPM. Furthermore, the DFM has no
mechanism to cater to the spatial relationships among a collection of operations.
Structured parallelism is treated the same as unstructured parallelism, and incurs
an unnecessary penalty. But the DFM does provide a hint of how to maintain fine-grain
parallelism: its pipeline is a ring, which is an infinite, but bounded structure.

The Domain Flow Architecture (DFA) builds upon this observation and supports and
maintains a local fine-grain spatial structure while offering an infinite computational
fabric with finite resources. DFA is to DFM as PIM is to SPM.

## Values in picojoules (pJ) per operation

| Operation Type | 28/22nm | 16/14/12nm | 7/6/5nm | 3nm | 2nm |
|---------------------------|----------|-----------|----------|---------|---------|
| 32-bit Register Read | 0.040 | 0.025 | 0.012 | 0.008 | 0.006 |
| 32-bit Register Write | 0.045 | 0.028 | 0.014 | 0.009 | 0.007 |
| 32-bit ALU Operation | 0.100 | 0.060 | 0.030 | 0.020 | 0.015 |
| 32-bit FPU Add | 0.400 | 0.250 | 0.120 | 0.080 | 0.060 |
| 32-bit FPU Multiply | 0.800 | 0.500 | 0.250 | 0.170 | 0.130 |
| 32-bit FPU FMA | 1.000 | 0.600 | 0.300 | 0.200 | 0.150 |
| 32-bit Word Read L1 | 0.625 | 0.375 | 0.1875 | 0.125 | 0.09375 |
| 32-bit Word Read L2 | 1.875 | 1.125 | 0.5625 | 0.375 | 0.28125 |
| 32-bit Word Read DDR5 | 6.25 | 5.000 | 3.750 | 3.125 | 2.8125 |
| 64-byte L1 Cache Read | 10.000 | 6.000 | 3.000 | 2.000 | 1.500 |
| 64-byte L2 Cache Read | 30.000 | 18.000 | 9.000 | 6.000 | 4.500 |
| 64-byte DDR5 Memory Read | 100.000 | 80.000 | 60.000 | 50.000 | 45.000 |

Table 1: Switching Energy Estimate by Process Node

**note**
1. 32-bit cache and memory operations are derived from 64byte read energy
2. Smaller process nodes generally reduces switching energy by roughly 40-50% per major node transition



30 changes: 25 additions & 5 deletions content/design/space.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@ space, as it takes time to do so.
What would be the best way to build scalable, parallel execution engines? In 1966,
Michael J. Flynn, proposed a taxonmy based on two dimensions, the parallelism of
data and instructions <sup>[1](#flynn)</sup>. A purely sequential machine has a
single instruction stream and a single data stream and the acronym *SISD*. A machine
that applies the same instruction on multiple data elements is a *SIMD* machine,
single instruction stream and a single data stream and the acronym **SISD**. A machine
that applies the same instruction on multiple data elements is a **SIMD** machine,
short for Single Instruction Multiple Data. Machines that have multiple instruction
streams operating on a single data element as used in fault-tolerant and redundant
system designs, and carry the designation *MISD*, Multiple Instruction Single Data.
The Multiple Instruction Multiple Data machine, or *MIMD*, consists of many processing
system designs, and carry the designation **MISD**, Multiple Instruction Single Data.
The Multiple Instruction Multiple Data machine, or **MIMD**, consists of many processing
elements simultaneously operating on different data.

The diagram below shows the Flynn Taxonomy <sup>[2](#wikipedia)</sup>:
Expand All @@ -40,7 +40,7 @@ _blocks_ of the data structure, and an exchange phase to communicate dependent d
the nodes.

For these algorithms to work well, the computation phase and the data exchange phase need
to be coordinated such that the program does not need to wait. Distributed Memory Machine
to be coordinated such that the program does not need to wait. Distributed Memory Machine (DMM)
algorithms have been studied to categorize them as a function of this constraint. This
categorization has become known as the Seven Dwarfs <sup>[3](#dwarfs)</sup>. Later refinement
has expanded that to thirteen dwarfs.
Expand All @@ -53,7 +53,27 @@ complexity, yielding so-called _weak scaling_ algorithms that can be made effici
by scaling up the data structure size. Any other information exchange will diminish
the compute efficiency of the distributed machine.

Supercomputers that are purpose-build for capability have been constructed as DMMs since
the early '90s, attesting to the success of the Distributed Memory Machine model.
And thirty years of empirical evidence has shown that designing these DMM algorithms is
difficult, due to the fact that the dynamic behavior of the actual execution requires
careful design and benchmarking to maximize resource efficiency.

The **MIMD** approach, however, is not isolated to just the Distributed Memory Machine.
Real-time data acquisition, signal processing, and control systems also demand parallel
execution, but systems for these use cases tend to be constructed very differently.
Instead of _blocking_ the computation to create subprograms that can be executed
on a Stored Program Machine, real-time system tend to favor distributed and balanced
data paths designed to never require dynamic reconfiguration.

Whereas Distributed Memory Machines require coarse-grain parallelism to work,
real-time systems tend to favor fine-grain parallelism. Fine-grain parallel systems
offer lower latencies, and an increasingly important benefit, energy efficiency.
In the next chapter, we'll discuss the techniques used to design spatial mappings
for fine-grained parallel machines.


**Footnotes**

<a name="flynn">1:</a> Flynn, Michael J. (December 1966), [Very high-speed computing systems](https://ieeexplore.ieee.org/document/1447203)

Expand Down
119 changes: 119 additions & 0 deletions content/design/switching-energy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
+++
prev = "/design/energy"
weight = 9
title = "Switching Energy Estimates"
toc = true
next = "/design/nextsteps"
date = "2025-01-05T13:39:38-05:00"

+++

This page contains background information regarding the switching energy estimates so
important to designing energy-efficient data paths.

## Register Read/Write Energy Estimates by Process Node
Note: Values are approximate and may vary by foundry and implementation

| Register | 28/22nm (fJ) | 16/14/12nm (fJ) | 7/6/5nm (fJ) | 3nm (fJ) | 2nm (fJ) |
|-----------|--------------|-----------------|--------------|-----------|-----------|
| Read bit | 2.5 - 3.5 | 1.8 - 2.3 | 0.9 - 1.2 | 0.6 - 0.8 | 0.4 - 0.6 |
| Write bit | 3.0 - 4.0 | 2.0 - 2.8 | 1.1 - 1.5 | 0.7 - 1.0 | 0.5 - 0.8 |

**Notes:**
- Values assume typical operating conditions (TT corner, nominal voltage, 25°C)
- Energy includes both dynamic and short-circuit power
- Leakage power not included
- Values are for basic register operations without additional clock tree or routing overhead
- Advanced nodes (3nm, 2nm) are based on early estimates and projections

## Register file energy estimates

All values in femtojoules per bit (fJ/bit)

| Operation | Size | 28/22nm | 16/14/12nm | 7/6/5nm | 3nm | 2nm |
|-----------|-----------|-------------|---------------|-------------|-------------|-------------|
| Read | | | | | | |
| | 32-entry | 8.5 - 10.5 | 6.00 - 7.50 | 3.20 - 4.00 | 2.25 - 2.80 | 1.57 - 1.95 |
| | 64-entry | 12.0 - 14.0 | 8.50 - 10.00 | 4.50 - 5.50 | 3.15 - 3.85 | 2.21 - 2.70 |
| | 128-entry | 16.0 - 18.0 | 11.00 - 13.00 | 6.00 - 7.00 | 4.20 - 4.90 | 2.95 - 3.45 |
| Write | | | | | | |
| | 32-entry | 10.0 - 12.0 | 7.00 - 8.50 | 3.80 - 4.60 | 2.65 - 3.25 | 1.85 - 2.28 |
| | 64-entry | 14.0 - 16.0 | 10.00 - 11.50 | 5.20 - 6.20 | 3.65 - 4.35 | 2.55 - 3.05 |
| | 128-entry | 18.0 - 20.0 | 13.00 - 15.0 | 7.00 - 8.00 | 4.90 - 5.60 | 3.45 - 3.95 |

**Notes:**
- All values in femtojoules per bit (fJ/bit)
- Assumes typical operating conditions (TT corner, nominal voltage, 25°C)
- Includes decoder, wordline, and bitline energy
- Includes local clock distribution
- Includes both dynamic and short-circuit power
- Values represent single read port, single write port configuration

## Integer Arithmetic and Logic Unit Switching Energy Estimates

| Unit Type | Bit Size | 28/22nm (pJ) | 16/14/12nm (pJ) | 7/6/5nm (pJ) | 3nm (pJ) | 2nm (pJ) |
|-----------|----------|--------------|-----------------|--------------|----------|----------|
| **CPU ALU** | | | | | | |
| | 8-bit | 0.45 - 0.65 | 0.30 - 0.43 | 0.20 - 0.29 | 0.13 - 0.19 | 0.09 - 0.13 |
| | 16-bit | 0.90 - 1.30 | 0.60 - 0.86 | 0.40 - 0.58 | 0.26 - 0.38 | 0.18 - 0.26 |
| | 24-bit | 1.35 - 1.95 | 0.90 - 1.30 | 0.60 - 0.87 | 0.39 - 0.57 | 0.27 - 0.40 |
| | 32-bit | 1.80 - 2.60 | 1.20 - 1.73 | 0.80 - 1.16 | 0.52 - 0.76 | 0.36 - 0.53 |
| | 40-bit | 2.25 - 3.25 | 1.50 - 2.16 | 1.00 - 1.45 | 0.65 - 0.95 | 0.45 - 0.66 |
| | 48-bit | 2.70 - 3.90 | 1.80 - 2.60 | 1.20 - 1.74 | 0.78 - 1.14 | 0.54 - 0.79 |
| | 56-bit | 3.15 - 4.55 | 2.10 - 3.03 | 1.40 - 2.03 | 0.91 - 1.33 | 0.63 - 0.92 |
| | 64-bit | 3.60 - 5.20 | 2.40 - 3.47 | 1.60 - 2.32 | 1.04 - 1.52 | 0.72 - 1.05 |
| **GPU ALU** | | | | | | |
| | 8-bit | 0.60 - 0.85 | 0.40 - 0.57 | 0.27 - 0.38 | 0.17 - 0.25 | 0.12 - 0.17 |
| | 16-bit | 1.20 - 1.70 | 0.80 - 1.14 | 0.53 - 0.76 | 0.35 - 0.50 | 0.24 - 0.35 |
| | 24-bit | 1.80 - 2.55 | 1.20 - 1.71 | 0.80 - 1.14 | 0.52 - 0.75 | 0.36 - 0.52 |
| | 32-bit | 2.40 - 3.40 | 1.60 - 2.28 | 1.07 - 1.52 | 0.69 - 1.00 | 0.48 - 0.70 |
| | 40-bit | 3.00 - 4.25 | 2.00 - 2.85 | 1.33 - 1.90 | 0.86 - 1.25 | 0.60 - 0.87 |
| | 48-bit | 3.60 - 5.10 | 2.40 - 3.42 | 1.60 - 2.28 | 1.04 - 1.50 | 0.72 - 1.04 |
| | 56-bit | 4.20 - 5.95 | 2.80 - 3.99 | 1.87 - 2.66 | 1.21 - 1.75 | 0.84 - 1.21 |
| | 64-bit | 4.80 - 6.80 | 3.20 - 4.56 | 2.13 - 3.04 | 1.38 - 2.00 | 0.96 - 1.38 |
| **DSP ALU** | | | | | | |
| | 8-bit | 0.55 - 0.75 | 0.37 - 0.53 | 0.25 - 0.35 | 0.16 - 0.23 | 0.11 - 0.16 |
| | 16-bit | 1.10 - 1.50 | 0.73 - 1.00 | 0.49 - 0.70 | 0.32 - 0.46 | 0.22 - 0.32 |
| | 24-bit | 1.65 - 2.25 | 1.10 - 1.50 | 0.73 - 1.05 | 0.48 - 0.69 | 0.33 - 0.48 |
| | 32-bit | 2.20 - 3.00 | 1.47 - 2.00 | 0.98 - 1.40 | 0.63 - 0.92 | 0.44 - 0.64 |
| | 40-bit | 2.75 - 3.75 | 1.83 - 2.50 | 1.22 - 1.75 | 0.79 - 1.15 | 0.55 - 0.80 |
| | 48-bit | 3.30 - 4.50 | 2.20 - 3.00 | 1.47 - 2.10 | 0.95 - 1.38 | 0.66 - 0.96 |
| | 56-bit | 3.85 - 5.25 | 2.57 - 3.50 | 1.71 - 2.45 | 1.11 - 1.61 | 0.77 - 1.12 |
| | 64-bit | 4.40 - 6.00 | 2.93 - 4.00 | 1.96 - 2.80 | 1.27 - 1.84 | 0.88 - 1.28 |

**Notes:**
- Values are approximate switching energy in picojoules (pJ)
- Represents typical dynamic switching energy per operation
- Accounts for:
- Arithmetic data path logic operations
- Typical instruction mix for each design point


# Floating-Point Unit Switching Energy Estimates

| Unit Type | Bit Size | 28/22nm (pJ) | 16/14/12nm (pJ) | 7/6/5nm (pJ) | 3nm (pJ) | 2nm (pJ) |
|-----------|----------|--------------|-----------------|--------------|----------|----------|
| **CPU FPU** | | | | | | |
| | 8-bit | 1.20 - 1.70 | 0.80 - 1.14 | 0.53 - 0.76 | 0.35 - 0.50 | 0.24 - 0.35 |
| | 16-bit | 1.80 - 2.60 | 1.20 - 1.73 | 0.80 - 1.16 | 0.52 - 0.76 | 0.36 - 0.53 |
| | 32-bit | 3.60 - 5.20 | 2.40 - 3.47 | 1.60 - 2.32 | 1.04 - 1.52 | 0.72 - 1.05 |
| | 64-bit | 7.20 - 10.40 | 4.80 - 6.93 | 3.20 - 4.64 | 2.08 - 3.04 | 1.44 - 2.10 |
| **GPU FPU** | | | | | | |
| | 8-bit | 1.60 - 2.30 | 1.07 - 1.53 | 0.71 - 1.02 | 0.46 - 0.66 | 0.32 - 0.46 |
| | 16-bit | 2.40 - 3.40 | 1.60 - 2.28 | 1.07 - 1.52 | 0.69 - 1.00 | 0.48 - 0.70 |
| | 32-bit | 4.80 - 6.80 | 3.20 - 4.56 | 2.13 - 3.04 | 1.38 - 2.00 | 0.96 - 1.38 |
| | 64-bit | 9.60 - 13.60 | 6.40 - 9.13 | 4.27 - 6.08 | 2.76 - 4.00 | 1.92 - 2.76 |
| **DSP FPU** | | | | | | |
| | 8-bit | 1.40 - 2.00 | 0.93 - 1.33 | 0.62 - 0.89 | 0.40 - 0.58 | 0.28 - 0.40 |
| | 16-bit | 2.20 - 3.00 | 1.47 - 2.00 | 0.98 - 1.40 | 0.63 - 0.92 | 0.44 - 0.64 |
| | 32-bit | 4.40 - 6.00 | 2.93 - 4.00 | 1.96 - 2.80 | 1.27 - 1.84 | 0.88 - 1.28 |
| | 64-bit | 8.80 - 12.00 | 5.87 - 8.00 | 3.91 - 5.60 | 2.54 - 3.68 | 1.76 - 2.56 |

**Notes:**
- Values are approximate switching energy in picojoules (pJ)
- 8-bit FPU estimates based on IEEE fp8 standard
- Represents typical dynamic switching energy per operation
- Accounts for:
- Arithmetic logic operations
- Floating-point operations (for FPU)
- Typical instruction mix for each design point

0 comments on commit db044c3

Please sign in to comment.