WIP: adding energy efficiency discussion

stillwater-sc · Jan 5, 2025 · db044c3 · db044c3
1 parent 9d026dc
commit db044c3
Show file tree

Hide file tree

Showing 3 changed files with 217 additions and 5 deletions.
diff --git a/content/design/energy.md b/content/design/energy.md
@@ -0,0 +1,73 @@
++++
+prev = "/design/space"
+weight = 8
+title = "Energy Efficiency"
+toc = true
+next = "/design/switching-energy"
+date = "2025-01-05T13:39:38-05:00"
+
++++
+
+Table 1 shows switching energy estimates of key computational events by process node.
+Data movement operations (reads and writes) have started to dominate energy consumption
+in modern processors. This makes a Stored Program Machine (SPM) less and less efficient.
+To counter this, all CPUs, GPUs, and DSPs have started to add instructions that amortize
+instruction processing among more computational intensity: they have all become **SIMD**
+machines. 
+
+Fundamentally, the SPM relies on a request/reply protocol to get information from a memory. 
+Otherwise stated, the resource contention mechanism deployed by a SPM uses a random
+access memory to store inputs, intermediate, and output values. And all this memory 
+management uses this request/reply cycle. Which we now know is becoming less and less
+energy efficient compared to the actual computational event the algorithm requires.
+The sequential processing model is becoming less and less energy efficient.
+
+We see that the further the memory is from the ALUs the worse the energy imbalance is.
+This has spawn Processor-In-Memory (PIM) and In-Memory-Compute (IMC) structures where
+the processing elements are multiplied and pushed into the memory. This improves the
+energy efficiency of the request/reply cycle, but it complicates the data distribution
+problem.
+
+Fine-grained data paths so common in real-time designs are more energy efficient than
+their SPM counterparts because they do not rely on the request/reply cycle. Instead, they
+operate in a pipeline fashion with parallel operational units directly writing results
+to the next stage, removing the reliance on random access memory to orchestrate
+computational schedules.
+
+We have seen the Data Flow Machine (DFM) maintaining fine-grain parallelism using a finite
+number of processing elements. Unfortunately, the basic operation of the DFM
+is less efficient than the basic operation of the SPM. Furthermore, the DFM has no
+mechanism to cater to the spatial relationships among a collection of operations.
+Structured parallelism is treated the same as unstructured parallelism, and incurs
+an unnecessary penalty. But the DFM does provide a hint of how to maintain fine-grain
+parallelism: its pipeline is a ring, which is an infinite, but bounded structure.
+
+The Domain Flow Architecture (DFA) builds upon this observation and supports and 
+maintains a local fine-grain spatial structure while offering an infinite computational
+fabric with finite resources. DFA is to DFM as PIM is to SPM.
+
+## Values in picojoules (pJ) per operation
+
+| Operation Type            | 28/22nm  | 16/14/12nm | 7/6/5nm | 3nm     | 2nm     |
+|---------------------------|----------|-----------|----------|---------|---------|
+| 32-bit Register Read      | 0.040    | 0.025     | 0.012    | 0.008   | 0.006   |
+| 32-bit Register Write     | 0.045    | 0.028     | 0.014    | 0.009   | 0.007   |
+| 32-bit ALU Operation      | 0.100    | 0.060     | 0.030    | 0.020   | 0.015   |
+| 32-bit FPU Add            | 0.400    | 0.250     | 0.120    | 0.080   | 0.060   |
+| 32-bit FPU Multiply       | 0.800    | 0.500     | 0.250    | 0.170   | 0.130   |
+| 32-bit FPU FMA            | 1.000    | 0.600     | 0.300    | 0.200   | 0.150   |
+| 32-bit Word Read L1       | 0.625    | 0.375     | 0.1875   | 0.125   | 0.09375 |
+| 32-bit Word Read L2       | 1.875    | 1.125     | 0.5625   | 0.375   | 0.28125 |
+| 32-bit Word Read DDR5     | 6.25     | 5.000     | 3.750    | 3.125   | 2.8125  |
+| 64-byte L1 Cache Read     | 10.000   | 6.000     | 3.000    | 2.000   | 1.500   |
+| 64-byte L2 Cache Read     | 30.000   | 18.000    | 9.000    | 6.000   | 4.500   |
+| 64-byte DDR5 Memory Read  | 100.000  | 80.000    | 60.000   | 50.000  | 45.000  |
+
+Table 1: Switching Energy Estimate by Process Node
+
+**note** 
+ 1. 32-bit cache and memory operations are derived from 64byte read energy
+ 2. Smaller process nodes generally reduces switching energy by roughly 40-50% per major node transition
+
+
+
diff --git a/content/design/space.md b/content/design/space.md
@@ -19,12 +19,12 @@ space, as it takes time to do so.
 What would be the best way to build scalable, parallel execution engines? In 1966,
 Michael J. Flynn, proposed a taxonmy based on two dimensions, the parallelism of
 data and instructions <sup>[1](#flynn)</sup>. A purely sequential machine has a
-single instruction stream and a single data stream and the acronym *SISD*. A machine
-that applies the same instruction on multiple data elements is a *SIMD* machine, 
+single instruction stream and a single data stream and the acronym **SISD**. A machine
+that applies the same instruction on multiple data elements is a **SIMD** machine, 
 short for Single Instruction Multiple Data. Machines that have multiple instruction
 streams operating on a single data element as used in fault-tolerant and redundant
-system designs, and carry the designation *MISD*, Multiple Instruction Single Data.
-The Multiple Instruction Multiple Data machine, or *MIMD*, consists of many processing
+system designs, and carry the designation **MISD**, Multiple Instruction Single Data.
+The Multiple Instruction Multiple Data machine, or **MIMD**, consists of many processing
 elements simultaneously operating on different data.
 
 The diagram below shows the Flynn Taxonomy <sup>[2](#wikipedia)</sup>:
@@ -40,7 +40,7 @@ _blocks_ of the data structure, and an exchange phase to communicate dependent d
 the nodes.
 
 For these algorithms to work well, the computation phase and the data exchange phase need
-to be coordinated such that the program does not need to wait. Distributed Memory Machine
+to be coordinated such that the program does not need to wait. Distributed Memory Machine (DMM)
 algorithms have been studied to categorize them as a function of this constraint. This
 categorization has become known as the Seven Dwarfs <sup>[3](#dwarfs)</sup>. Later refinement
 has expanded that to thirteen dwarfs. 
@@ -53,7 +53,27 @@ complexity, yielding so-called _weak scaling_ algorithms that can be made effici
 by scaling up the data structure size. Any other information exchange will diminish
 the compute efficiency of the distributed machine.
 
+Supercomputers that are purpose-build for capability have been constructed as DMMs since
+the early '90s, attesting to the success of the Distributed Memory Machine model. 
+And thirty years of empirical evidence has shown that designing these DMM algorithms is 
+difficult, due to the fact that the dynamic behavior of the actual execution requires
+careful design and benchmarking to maximize resource efficiency.
 
+The **MIMD** approach, however, is not isolated to just the Distributed Memory Machine.
+Real-time data acquisition, signal processing, and control systems also demand parallel
+execution, but systems for these use cases tend to be constructed very differently. 
+Instead of _blocking_ the computation to create subprograms that can be executed 
+on a Stored Program Machine, real-time system tend to favor distributed and balanced 
+data paths designed to never require dynamic reconfiguration.
+
+Whereas Distributed Memory Machines require coarse-grain parallelism to work, 
+real-time systems tend to favor fine-grain parallelism. Fine-grain parallel systems
+offer lower latencies, and an increasingly important benefit, energy efficiency.
+In the next chapter, we'll discuss the techniques used to design spatial mappings
+for fine-grained parallel machines.
+
+
+**Footnotes**
 
 <a name="flynn">1:</a> Flynn, Michael J. (December 1966), [Very high-speed computing systems](https://ieeexplore.ieee.org/document/1447203)
 

diff --git a/content/design/switching-energy.md b/content/design/switching-energy.md
@@ -0,0 +1,119 @@
++++
+prev = "/design/energy"
+weight = 9
+title = "Switching Energy Estimates"
+toc = true
+next = "/design/nextsteps"
+date = "2025-01-05T13:39:38-05:00"
+
++++
+
+This page contains background information regarding the switching energy estimates so
+important to designing energy-efficient data paths.
+
+## Register Read/Write Energy Estimates by Process Node
+Note: Values are approximate and may vary by foundry and implementation
+
+| Register  | 28/22nm (fJ) | 16/14/12nm (fJ) | 7/6/5nm (fJ) | 3nm (fJ)  | 2nm (fJ)  |
+|-----------|--------------|-----------------|--------------|-----------|-----------|
+| Read bit  | 2.5 - 3.5    | 1.8 - 2.3       | 0.9 - 1.2    | 0.6 - 0.8 | 0.4 - 0.6 |
+| Write bit | 3.0 - 4.0    | 2.0 - 2.8       | 1.1 - 1.5    | 0.7 - 1.0 | 0.5 - 0.8 |
+
+**Notes:**
+- Values assume typical operating conditions (TT corner, nominal voltage, 25°C)
+- Energy includes both dynamic and short-circuit power
+- Leakage power not included
+- Values are for basic register operations without additional clock tree or routing overhead
+- Advanced nodes (3nm, 2nm) are based on early estimates and projections
+
+## Register file energy estimates
+
+ All values in femtojoules per bit (fJ/bit)
+
+| Operation | Size      | 28/22nm     | 16/14/12nm    | 7/6/5nm     | 3nm         | 2nm         |
+|-----------|-----------|-------------|---------------|-------------|-------------|-------------|
+| Read      |           |             |               |             |             |             |
+|           | 32-entry  | 8.5  - 10.5 | 6.00  -  7.50 | 3.20 - 4.00 | 2.25 - 2.80 | 1.57 - 1.95 |
+|           | 64-entry  | 12.0 - 14.0 | 8.50  - 10.00 | 4.50 - 5.50 | 3.15 - 3.85 | 2.21 - 2.70 |
+|           | 128-entry | 16.0 - 18.0 | 11.00 - 13.00 | 6.00 - 7.00 | 4.20 - 4.90 | 2.95 - 3.45 |
+| Write     |           |             |               |             |             |             |
+|           | 32-entry  | 10.0 - 12.0 | 7.00  -  8.50 | 3.80 - 4.60 | 2.65 - 3.25 | 1.85 - 2.28 |
+|           | 64-entry  | 14.0 - 16.0 | 10.00 - 11.50 | 5.20 - 6.20 | 3.65 - 4.35 | 2.55 - 3.05 |
+|           | 128-entry | 18.0 - 20.0 | 13.00 - 15.0  | 7.00 - 8.00 | 4.90 - 5.60 | 3.45 - 3.95 |
+
+**Notes:**
+- All values in femtojoules per bit (fJ/bit)
+- Assumes typical operating conditions (TT corner, nominal voltage, 25°C)
+- Includes decoder, wordline, and bitline energy
+- Includes local clock distribution
+- Includes both dynamic and short-circuit power
+- Values represent single read port, single write port configuration
+
+## Integer Arithmetic and Logic Unit Switching Energy Estimates
+
+| Unit Type | Bit Size | 28/22nm (pJ) | 16/14/12nm (pJ) | 7/6/5nm (pJ) | 3nm (pJ) | 2nm (pJ) |
+|-----------|----------|--------------|-----------------|--------------|----------|----------|
+| **CPU ALU** | | | | | | |
+| | 8-bit     | 0.45 - 0.65  | 0.30 - 0.43    | 0.20 - 0.29   | 0.13 - 0.19 | 0.09 - 0.13 |
+| | 16-bit    | 0.90 - 1.30  | 0.60 - 0.86    | 0.40 - 0.58   | 0.26 - 0.38 | 0.18 - 0.26 |
+| | 24-bit    | 1.35 - 1.95  | 0.90 - 1.30    | 0.60 - 0.87   | 0.39 - 0.57 | 0.27 - 0.40 |
+| | 32-bit    | 1.80 - 2.60  | 1.20 - 1.73    | 0.80 - 1.16   | 0.52 - 0.76 | 0.36 - 0.53 |
+| | 40-bit    | 2.25 - 3.25  | 1.50 - 2.16    | 1.00 - 1.45   | 0.65 - 0.95 | 0.45 - 0.66 |
+| | 48-bit    | 2.70 - 3.90  | 1.80 - 2.60    | 1.20 - 1.74   | 0.78 - 1.14 | 0.54 - 0.79 |
+| | 56-bit    | 3.15 - 4.55  | 2.10 - 3.03    | 1.40 - 2.03   | 0.91 - 1.33 | 0.63 - 0.92 |
+| | 64-bit    | 3.60 - 5.20  | 2.40 - 3.47    | 1.60 - 2.32   | 1.04 - 1.52 | 0.72 - 1.05 |
+| **GPU ALU** | | | | | | |
+| | 8-bit     | 0.60 - 0.85  | 0.40 - 0.57    | 0.27 - 0.38   | 0.17 - 0.25 | 0.12 - 0.17 |
+| | 16-bit    | 1.20 - 1.70  | 0.80 - 1.14    | 0.53 - 0.76   | 0.35 - 0.50 | 0.24 - 0.35 |
+| | 24-bit    | 1.80 - 2.55  | 1.20 - 1.71    | 0.80 - 1.14   | 0.52 - 0.75 | 0.36 - 0.52 |
+| | 32-bit    | 2.40 - 3.40  | 1.60 - 2.28    | 1.07 - 1.52   | 0.69 - 1.00 | 0.48 - 0.70 |
+| | 40-bit    | 3.00 - 4.25  | 2.00 - 2.85    | 1.33 - 1.90   | 0.86 - 1.25 | 0.60 - 0.87 |
+| | 48-bit    | 3.60 - 5.10  | 2.40 - 3.42    | 1.60 - 2.28   | 1.04 - 1.50 | 0.72 - 1.04 |
+| | 56-bit    | 4.20 - 5.95  | 2.80 - 3.99    | 1.87 - 2.66   | 1.21 - 1.75 | 0.84 - 1.21 |
+| | 64-bit    | 4.80 - 6.80  | 3.20 - 4.56    | 2.13 - 3.04   | 1.38 - 2.00 | 0.96 - 1.38 |
+| **DSP ALU** | | | | | | |
+| | 8-bit     | 0.55 - 0.75  | 0.37 - 0.53    | 0.25 - 0.35   | 0.16 - 0.23 | 0.11 - 0.16 |
+| | 16-bit    | 1.10 - 1.50  | 0.73 - 1.00    | 0.49 - 0.70   | 0.32 - 0.46 | 0.22 - 0.32 |
+| | 24-bit    | 1.65 - 2.25  | 1.10 - 1.50    | 0.73 - 1.05   | 0.48 - 0.69 | 0.33 - 0.48 |
+| | 32-bit    | 2.20 - 3.00  | 1.47 - 2.00    | 0.98 - 1.40   | 0.63 - 0.92 | 0.44 - 0.64 |
+| | 40-bit    | 2.75 - 3.75  | 1.83 - 2.50    | 1.22 - 1.75   | 0.79 - 1.15 | 0.55 - 0.80 |
+| | 48-bit    | 3.30 - 4.50  | 2.20 - 3.00    | 1.47 - 2.10   | 0.95 - 1.38 | 0.66 - 0.96 |
+| | 56-bit    | 3.85 - 5.25  | 2.57 - 3.50    | 1.71 - 2.45   | 1.11 - 1.61 | 0.77 - 1.12 |
+| | 64-bit    | 4.40 - 6.00  | 2.93 - 4.00    | 1.96 - 2.80   | 1.27 - 1.84 | 0.88 - 1.28 |
+
+**Notes:**
+- Values are approximate switching energy in picojoules (pJ)
+- Represents typical dynamic switching energy per operation
+- Accounts for:
+  - Arithmetic data path logic operations
+  - Typical instruction mix for each design point
+
+
+# Floating-Point Unit Switching Energy Estimates
+
+| Unit Type | Bit Size | 28/22nm (pJ) | 16/14/12nm (pJ) | 7/6/5nm (pJ) | 3nm (pJ) | 2nm (pJ) |
+|-----------|----------|--------------|-----------------|--------------|----------|----------|
+| **CPU FPU** | | | | | | |
+| | 8-bit     | 1.20 - 1.70  | 0.80 - 1.14    | 0.53 - 0.76   | 0.35 - 0.50 | 0.24 - 0.35 |
+| | 16-bit    | 1.80 - 2.60  | 1.20 - 1.73    | 0.80 - 1.16   | 0.52 - 0.76 | 0.36 - 0.53 |
+| | 32-bit    | 3.60 - 5.20  | 2.40 - 3.47    | 1.60 - 2.32   | 1.04 - 1.52 | 0.72 - 1.05 |
+| | 64-bit    | 7.20 - 10.40 | 4.80 - 6.93    | 3.20 - 4.64   | 2.08 - 3.04 | 1.44 - 2.10 |
+| **GPU FPU** | | | | | | |
+| | 8-bit     | 1.60 - 2.30  | 1.07 - 1.53    | 0.71 - 1.02   | 0.46 - 0.66 | 0.32 - 0.46 |
+| | 16-bit    | 2.40 - 3.40  | 1.60 - 2.28    | 1.07 - 1.52   | 0.69 - 1.00 | 0.48 - 0.70 |
+| | 32-bit    | 4.80 - 6.80  | 3.20 - 4.56    | 2.13 - 3.04   | 1.38 - 2.00 | 0.96 - 1.38 |
+| | 64-bit    | 9.60 - 13.60 | 6.40 - 9.13    | 4.27 - 6.08   | 2.76 - 4.00 | 1.92 - 2.76 |
+| **DSP FPU** | | | | | | |
+| | 8-bit     | 1.40 - 2.00  | 0.93 - 1.33    | 0.62 - 0.89   | 0.40 - 0.58 | 0.28 - 0.40 |
+| | 16-bit    | 2.20 - 3.00  | 1.47 - 2.00    | 0.98 - 1.40   | 0.63 - 0.92 | 0.44 - 0.64 |
+| | 32-bit    | 4.40 - 6.00  | 2.93 - 4.00    | 1.96 - 2.80   | 1.27 - 1.84 | 0.88 - 1.28 |
+| | 64-bit    | 8.80 - 12.00 | 5.87 - 8.00    | 3.91 - 5.60   | 2.54 - 3.68 | 1.76 - 2.56 |
+
+**Notes:**
+- Values are approximate switching energy in picojoules (pJ)
+- 8-bit FPU estimates based on IEEE fp8 standard
+- Represents typical dynamic switching energy per operation
+- Accounts for:
+  - Arithmetic logic operations
+  - Floating-point operations (for FPU)
+  - Typical instruction mix for each design point