Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design discussion #38

Open
wtarreau opened this issue May 17, 2019 · 5 comments
Open

Design discussion #38

wtarreau opened this issue May 17, 2019 · 5 comments

Comments

@wtarreau
Copy link

This is the continuation of the discussion started at the end of #15.

@bschn2 you make a good point indeed. However then what can be done is to compute a pre-hash of the whole initial message used as an IV for AES transformations for example, and fill the whole RAM area with this. This guarantees you need to use that much writable memory. Then you can perform some random walks there and mix the data until you find a pattern that works. Then it will be the memory fill time which will be a bottleneck. It's actually nice as even on the A53, while the L1 cache data path is 64-bit for reads, it's still 128-bit for writes. So there is less of a bottleneck here for small CPUs. Just as an estimate, if we use 32-bit 1600 MT/s DDR3/4, that's 6.4 GB/s theorical peak, thus 10ms to fill 64 MB. You'd cap at 100 H/s on a small machine, is this too low ?

@bschn2
Copy link
Owner

bschn2 commented May 17, 2019

No it is not too low, it's pretty usable, even on a fast block coin like MBC. In one minute you can visit 6000 nonces, you just have to start with a low difficulty. I remember seeing cryptonight around 30 H/s or so some time ago and it had happy users at XMR.

Your idea of maxing out the write bandwidth is interesting. However it is mandatory that the visited locations cannot be predicted before the area is full. E.g. you read a location, it gives you a pointer.
However, keep in mind that AES is trivial on hardware and that CRC32 is doable as well albeit takes a lot of place. 64MB of SRAM is possible (but expensive). You could design an ASIC capable of issuing a 512-bit block every 12ns and write to SRAM in parallel this fast. It would only take 87 microseconds to fill this area in an ASIC. So you need to add substantial operations to either fill or read this area. I'm inclined to think that writing very fast first then performing very expensive lookups might be a solution, but for an ASIC it could allow to design it in two parts using different technologies, such as ASIC for writes and FPGA/CPU for reads. The other option is to count the time it takes to write a 64B cache line on a target CPU and to put enough complex operations (rol/ror, bit reversal, 64-bit div, CRC, AES, etc) to keep the CPU busy during this time. In your example above you have 10ns available, or 15 cycles on small CPUs to 40 cycles on big ones. It seems possible to place a div,rot,crc,aes there and have them computed for free. Some of them are very problematic for ASICs (div) and will not be easily parallelized. Also using different operations for each word prevents pipelining of hardware units in ASICs. So it could be nice to cut each cache line into 8 64b words and perform different operations on each of them. It would be possible to exploit slightly unaligned accesses during the writes to make the filling process very hard to parallelize. Imagine the mux/demux dealing with 8-bit granularity out of a 512b memory bus! That would require 6 layers of 256 muxes each, and would be enough to ruin any hopes of 12ns accesses on SRAM.

@wtarreau
Copy link
Author

I'm currently revisiting this research with new elements. I've looked a bit at how GDDR5 works and just like DDR4, it suffers from long delays after closing a row, or after having accessed 4 different banks. You can typically expect around 13ns for the first access but when you have to add 43ns to close the row because you stay in the same bank, it means that random reads will still be highly impacted. Also I found that a GTX1080Ti card has 3584 cores sharing 11 memory channels, that's 325 cores per channel! Thus if we can make the workload heavily dependent on memory access time, the extra cores will only be usable to compensate the lack of crypto extensions. Given that a regular CPU fetches a full 64B cache line at once from memory and that GDDR5X also fetches 64B (32B for GDDR5), it makes sense to design a workload relying on very fast operations involving these fetched bytes so that the machines with a small CPU count per memory channel (typically 4) remain memory bound and not CPU bound.

I also found interesting to note that high-end GPUs have a small ratio of FP64-capable cores. I'm seeing numbers like 1:32 ratio on GTX1070 and 1:8 on some AMD GPUs. This could be a good way to limit the number of cores that can make fast operations.

In the end I'm interested in involving a lot of slow operations. This way even if you can optimize a part for a certain hardware, you don't save a lot. Typically I think that I'll pre-fill a work area with 64 MB of data depending on the block and on 24 bits from the nonce. This can take some time because it will be done only once every 256 hashes. On a GPU, this means that if this operation takes as much time as it takes to compute 256 hashes, it means that half of the crypto time cannot be parallelized on all cores. Thus one core fills the data then 256 can use it (or more likely the CPU prepares data and quickly uploads them). Such writes could make use of FP64 by the way. This would force any implementation (including FPGA/ASICs) to rely on soft cores and be CPU-bound on the write time. It would be the reads that would be fast but involve serial operations that are fast on small CPUs (crc, div, rot, add, rev) and slower on huge CPUs or GPUs. I was thinking that it would be nice to involve some operations like sha2 in the fast path but some devices like RPi sadly still lack such basic extensions (and at this point they're the only ones living in the 20th century) so we would impact such devices too strongly.

I started experimenting with new code and figured it was convenient to reuse some of your existing code from rfv2, thus I'm not sure whether it's better to fork rfv2 into rfv3 (or anything else) or to restart from scratch and just import some code parts and attribute credits. I already managed to cut the files in a more portable way. Just let me know what option you prefer.

@bschn2
Copy link
Owner

bschn2 commented Sep 12, 2019

Good morning Sir!

Nice work, and glad to see you back working on this! The timing you're talking about is t_FAW, it means "four activate window" and indicates a mandatory delay before issuing a fifth activate command. In other words it's a sliding window and you cannot have more than 4 activate commands in this window, regardless whether they are for a bank of the same group or not.
Please have a look there for more detailed information, it's quite up to date and well explained: https://www.systemverilog.io/ddr4-basics https://www.systemverilog.io/understanding-ddr4-timing-parameters

If you manage to create a purely memory-bound design, it could make a high-end GTX 1080 Ti only 11 times faster than any low-end device such as a raspberry pi 3B, that would be really fair in terms of cost, power consumption, and electronic waste.

I didn't notice this 1:X FP64 ratio on GPUs, it's also true that I didn't really consider using floating point in my initial work. With this said, when you have 300 cores per memory channel, you'd still have ~10 FP64 cores per memory channel so I don't think you will be able to draw any benefit from this limitation.

Regarding the code, I would suggest that you don't embarrass yourself with the existing layout if it does not best suit your needs. Feel free to pick whatever you need there, as you know I don't really claim anything (and there's even quite a sensible part of this code from you if my memory serves me right). If you find a better split that could help further experimentation, feel free to contribute it, but if it's too much a hassle, do not spend your valuable time on this.

Kind regards,
B.S.

@wtarreau
Copy link
Author

wtarreau commented Sep 12, 2019 via email

@bschn2
Copy link
Owner

bschn2 commented Sep 12, 2019

Thank you for all the details. I think you have a plan here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants