SGEMM-SASS-Annotation

Background

The project MaxAs is a famous assembler project for NVIDIA Maxwell architecture. It gives a well-known sass demo SGEMM for the matrix multiplication which has a better performance than cuBLAS. However, not only the tutorial ("SGEMM.pdf" in this repo) but also the sass code ("sgemm64_origin.sass" in this repo) are oversimplified. I strongly recommend Chinese readers refer to the blog by XiaoyuWang for a detailed theoretical explanation. As far as I knew, there is no annotated sass code when I write this markdown file.

How-to

The annotated code is in "sgemm64.sass", take some codes as example:

--:-:1:-:1      S2R R119, SR_TID.X;                                 # R119 = threadIdx.x, tid
--:-:2:-:1      S2R R125, SR_CTAID.X;                               # R125 = blockIdx.x, bx
--:-:3:-:1      S2R R122, SR_CTAID.Y;                               # R122 = blockIdx.y, by
01:-:-:-:1      ISETP.GE.AND P0, PT, R119.reuse, 0x20, PT;          # P0 = tid >= 32
--:-:-:-:1      LOP.AND R9, R119.reuse, 0xf;                        # R9 = tid & 0xf, tid15

Words after # are my annotation, annotation R119 = threadIdx.x, tid means the 119th register stores the value of threadIdx.x and it has a alias tid; annotation R9 = tid & 0xf, tid15 means the 9th register stores the result of tid & 0xf and the alias of the 9th register is tid15. Often, the alias is corresponding to the alias in the pseudo-code given by "SGEMM.pdf":

tid = threadId.x;
bx = blockId.x;
by = blockId.y;
blk = tid >= 32 ? by : bx;
ldx = tid >= 32 ? ldb/4 : lda/4;
tex = tid >= 32 ? texB : texA;

Here is another example:

--:-:-:-:1      ISCADD R112, R8, R9, 0x4;                           # R112 = blk << 4 + tid15, _track0

...

--:-:-:-:1      XMAD.MRG R5, R1.reuse, R4.H1.reuse, RZ;             # R5 = ldx * tid2, _track0

...

--:-:-:Y:6      XMAD R112, R1.reuse, R4, R112;                      # R112 = ldx * tid2 + R112, track0

According to the tutorial, variable track0 is calculated by the following operation:

track0 = blk*64/4 + tid15 + (ldx * tid2);

However, it is impossible to calculate track0 in one step. The intermediate variables produced by the intermediate process are all named with an underline in front of the final alias: _track0.

In the example:

--:-:-:-:1      LDS.U.128 R64, [R114];                              # R64 ~ R67 = [readAs+4*(0*64+0) ~ readAs+4*(0*64+0)+3*4]
--:-:-:-:1      LDS.U.128 R72, [R115];                              # R72 ~ R75 = [readBs+4*(0*64+0) ~ readBs+4*(0*64+0)+3*4]
--:-:-:-:1      LDS.U.128 R68, [R114+0x80];                         # R68 ~ R71 = [readAs+4*(0*64+32) ~ readAs+4*(0*64+32)+3*4]
--:-:1:-:1      LDS.U.128 R76, [R115+0x80];                         # R76 ~ R79 = [readBs+4*(01*64+32) ~ readBs+4*(0*64+32)+3*4]

Because the instruction LDS.U.128 will load 4 4-Bytes numbers, there will be 4 registers with the continuous index will be loaded from the shared memory.

About

The code is given by Scott Gray (scott[at]openai.com) in the project MaxAs, annotation is given by Jie Ren (jieren9806[at]gmail.com). Because of my limited knowledge, there would be some mistakes in my annotation, feel free to discuss with me. I will be appreciated for your valuable comments. This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SGEMM-SASS-Annotation

Background

How-to

About

Files

README.md

Latest commit

History

README.md

File metadata and controls

SGEMM-SASS-Annotation

Background

How-to

About