HLS C Code Explanation

This page is to explain some features of the HLS C code implementation currently in the repo (4/5/2013). This will cover aspects of how to setup the HLS code and where to place HLS optimization directives.

Basics of HLS

The first thing to notice about HLS is that it is based off the Eclipse IDE. There are some Eclipse features left in, so if you are used to Eclipse, more power to you. However, there are a lot of changes that are specific to HLS in general.

This wiki page will not go over the Vivado HLS IDE (Xilinx has provided documentation on that already, no need to repeat it).

Function Explanations

This section will go over each of the functions generated and explain how the code works and how the HLS directives will relate to these. Keep in mind, the code implemented at the moment is not optimized for speed.

miner

The top level function. In digital design with FPGAs/ASICs a top-level must be defined where everything is, sort of like the main() function in C/C++. In this the SHA256 algorithm is called twice. Once on the 80 bytes of data, and the second time on the result of the first.

As of right now, no optimization is used to reduce the midstate, but it will be done in the future.

sha256_init

This is the general initialization step of the SHA256 algorithm. It is only called once in the code, but should be considered ROM on start-up. This means that, when the chip turns on, it'll immediately load a BRAM as BROM with the values specified. This will always be this way until the chip is turned off or loses power.

sha256_update_large & sha256_update_small

There are 2 functions here, but both do the same thing. The reason for the two functions is because of HLS's requirement to not have dynamic arrays.

Why doesn't HLS allow for dynamic arrays? The way FPGAs work is that your system is static; this means that you can't allocate a LUT, FF, or any component on the chip whenever your system wants to. Now a few will say, "But you can put picoBlaze/MicroBlaze and it'll it." That is different, you're setting up a CPU and you're telling it to allocate so much BRAM to use for that purpose.

Now the functions themselves really do 2 things:

It puts the data into the context struct dubbed ctx.
Hash every 64 bytes it gets.

The main difference between the two functions is the length size. sha256_update_large takes 80 bytes, and the sha256_update_small takes 32 bytes. Ultimately, this means that the sha256_update_small does not computer the hash like the sha256_update_large has to. This provides an opportunity to pre-process the midstate.

sha256_transform

This is the meat of the SHA256 function. The following optimizations are utilized:

Dataflow - Optimizes performance of loops. In more general terms, it will try to do a pseudo-pipeline which means that it'll start to do other operations when the loop isn't finished. An example is provided later.
Unroll - Generally, this parallelizes the commands. It is mostly used with loops, but some operations can use it as well.
Allocation - Optimizes operations (so instead of doing 2 operations in one command, it is forced to use 1). This allows more efficiency and less utilization.

With the Dataflow optimization, let's look at some snippet of code:

for (i = 0; i < 64; ++i) {
                t1 = h + EP1(e) + CH(e,f,g) + k[i] + m[i];
                t2 = EP0(a) + MAJ(a,b,c);
                h = g;
                g = f;
                f = e;
                e = d + t1;
                d = c;
                c = b;
                b = a;
                a = t1 + t2;
        }

This is the main hashing for loop. As you can tell generally, t1, t2, e, d, and a are the only variables that depend on each other. The rest depend on each other. This means b, c, f, g, h can have assignments to them without losing data integrity. So in one cycle: b, c, f, g, h, t1, t2 will be written to, and the next cycle: a, e, d will be in the next.

I read a research paper explaining that there is some optimization that can be done here because of patterns that pop-up when the loop is progressing. These optimizations will be considered for the future.

sha256_final

The final step in the SHA256 algorithm. Really this is where the true hashing is done. It has to make sure all data is in the proper format for the last transform call. With bitcoin, this size never changes, and can usually be processed right away. Once this hashing is done, the byte ordering has to change. SHA256 uses little endian, and we need to change it back to big endian (the way we got it). Now the hash is done.

There can be optimization done here, but every time I try to keep it little endian, it doesn't turn out right.

Tips/Tricks

You can set directives in 3 ways: 1) in the C using #pragma HLS, 2) in the directives.tcl file, 3) Via the GUI (which will place the directive in the C source or directives.tcl)
I recommend mixing the two. The reason is optimizations like UNROLL and EXPRESSION_BALANCE require you to provide what it will optimize, by putting these next to loops and at the top line in a function, you don't need to worry about this, while doing interface adjustment keep it in the directives.tcl file.
Don't be discouraged if an optimization doesn't do anything to the overall design
Generally, HLS will try to optimize the best it can. So, some optimizations you put in it might already be doing. Remember that any optimizations you put in need to be specific in order to see changes.
Synthesizing is cheap, Export to RTL is not
When adjusting optimization statements, synthesize after changes. In this design on a Dell Latitude E6530 laptop, synthesis takes less than 30 seconds. Plus, the report it generates can guide you to where optimizations can take place.
Be aggressive in your clocking.
The best way to optimize the design is to have HLS over constrain itself with a very aggressive clock speed. This way better optimization can take place, and when you Export to RTL and you have it evaluate, it can take a better position in synthesis and implementation.
HDL optimization is always an option
If the design is as optimized as it can get, you can have Export to RTL evaluate in the PlanAhead tool which will create a project and go through XST, MAP, and PAR. From here you can open the project and do your floorplanning and other timing constraint related optimization.
Be careful with this, if you make changes here, archive the project to a safe location, because if you do changes in HLS and re-Export to RTL, it will overwrite the project.
Reduce memory (BRAM/LUT) calling in your code as much as possible
It is easier to meet timing with registers (flip-flops) than LUTs.
Also note that sometimes BRAMs offer better latency (clock cycles) than trying to roll everything in registers.
The use of Labels in C/C++ can be a good thing...
This is to not be confused with label/goto statements (those are evil), but rather sections of code (HLS refers to these as regions) in which we can hard code optimization directives.
These labels should be placed around similar operations, and loops. This allows us to "modularize" the code so it can be optimized in sections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly