-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design discussion #38
Comments
No it is not too low, it's pretty usable, even on a fast block coin like MBC. In one minute you can visit 6000 nonces, you just have to start with a low difficulty. I remember seeing cryptonight around 30 H/s or so some time ago and it had happy users at XMR. Your idea of maxing out the write bandwidth is interesting. However it is mandatory that the visited locations cannot be predicted before the area is full. E.g. you read a location, it gives you a pointer. |
I'm currently revisiting this research with new elements. I've looked a bit at how GDDR5 works and just like DDR4, it suffers from long delays after closing a row, or after having accessed 4 different banks. You can typically expect around 13ns for the first access but when you have to add 43ns to close the row because you stay in the same bank, it means that random reads will still be highly impacted. Also I found that a GTX1080Ti card has 3584 cores sharing 11 memory channels, that's 325 cores per channel! Thus if we can make the workload heavily dependent on memory access time, the extra cores will only be usable to compensate the lack of crypto extensions. Given that a regular CPU fetches a full 64B cache line at once from memory and that GDDR5X also fetches 64B (32B for GDDR5), it makes sense to design a workload relying on very fast operations involving these fetched bytes so that the machines with a small CPU count per memory channel (typically 4) remain memory bound and not CPU bound. I also found interesting to note that high-end GPUs have a small ratio of FP64-capable cores. I'm seeing numbers like 1:32 ratio on GTX1070 and 1:8 on some AMD GPUs. This could be a good way to limit the number of cores that can make fast operations. In the end I'm interested in involving a lot of slow operations. This way even if you can optimize a part for a certain hardware, you don't save a lot. Typically I think that I'll pre-fill a work area with 64 MB of data depending on the block and on 24 bits from the nonce. This can take some time because it will be done only once every 256 hashes. On a GPU, this means that if this operation takes as much time as it takes to compute 256 hashes, it means that half of the crypto time cannot be parallelized on all cores. Thus one core fills the data then 256 can use it (or more likely the CPU prepares data and quickly uploads them). Such writes could make use of FP64 by the way. This would force any implementation (including FPGA/ASICs) to rely on soft cores and be CPU-bound on the write time. It would be the reads that would be fast but involve serial operations that are fast on small CPUs (crc, div, rot, add, rev) and slower on huge CPUs or GPUs. I was thinking that it would be nice to involve some operations like sha2 in the fast path but some devices like RPi sadly still lack such basic extensions (and at this point they're the only ones living in the 20th century) so we would impact such devices too strongly. I started experimenting with new code and figured it was convenient to reuse some of your existing code from rfv2, thus I'm not sure whether it's better to fork rfv2 into rfv3 (or anything else) or to restart from scratch and just import some code parts and attribute credits. I already managed to cut the files in a more portable way. Just let me know what option you prefer. |
Good morning Sir! Nice work, and glad to see you back working on this! The timing you're talking about is t_FAW, it means "four activate window" and indicates a mandatory delay before issuing a fifth activate command. In other words it's a sliding window and you cannot have more than 4 activate commands in this window, regardless whether they are for a bank of the same group or not. If you manage to create a purely memory-bound design, it could make a high-end GTX 1080 Ti only 11 times faster than any low-end device such as a raspberry pi 3B, that would be really fair in terms of cost, power consumption, and electronic waste. I didn't notice this 1:X FP64 ratio on GPUs, it's also true that I didn't really consider using floating point in my initial work. With this said, when you have 300 cores per memory channel, you'd still have ~10 FP64 cores per memory channel so I don't think you will be able to draw any benefit from this limitation. Regarding the code, I would suggest that you don't embarrass yourself with the existing layout if it does not best suit your needs. Feel free to pick whatever you need there, as you know I don't really claim anything (and there's even quite a sensible part of this code from you if my memory serves me right). If you find a better split that could help further experimentation, feel free to contribute it, but if it's too much a hassle, do not spend your valuable time on this. Kind regards, |
On Wed, Sep 11, 2019 at 11:40:05PM -0700, bschn2 wrote:
Please have a look there for more detailed information, it's quite up to date
and well explained: https://www.systemverilog.io/ddr4-basics
https://www.systemverilog.io/understanding-ddr4-timing-parameters
Thanks a lot for the links, this is exactly what I was looking for! I
could only find information from vendors datasheets, which assume the
reader already knows this well :-)
If you manage to create a purely memory-bound design, it could make a
high-end GTX 1080 Ti only 11 times faster than any low-end device such as a
raspberry pi 3B, that would be really fair in terms of cost, power
consumption, and electronic waste.
Yes this is the idea. I still think that this GRRD5X is a bit faster
than DDR4. So maybe the card will be 20 times faster. But even 20 times
faster remains OK in my opinion, the break-even would be between 30-50
times an RPi3B. I bought an RPi4B which runs fine at 2 GHz, that's 8 GHz
of A72 for $35, it can constitute an attractive device for many people.
I didn't notice this 1:X FP64 ratio on GPUs, it's also true that I didn't
really consider using floating point in my initial work. With this said, when
you have 300 cores per memory channel, you'd still have ~10 FP64 cores per
memory channel so I don't think you will be able to draw any benefit from
this limitation.
In fact I was seeing it differently. I think this will be one way to keep
the cores quite busy, and still make the workload way easier to implement
on CPU/GPU than on FPGA/ASIC. Given rounding method is well defined in C99
and IEC559 and is configurable with fesetround() we should avoid the issues
we faced lately while making it hard for an ASIC to implement a compliant
stack (adding inverted values while keeping a valid 52-bit mantissa with
no rounding error is likely quite tricky). By the way I'm fine with still
having 0.5bit error when doing no more than 2^20 operations, because I'd
then have 32-bit integers with a 20-bit fractional part that will not
bleed into the integer part unless at least 1 bit of error appears. And
my first tests indicate that the result indeed slightly changes when
adjusting the rounding mode using fesetround().
Regarding the code, I would suggest that you don't embarrass yourself with
the existing layout if it does not best suit your needs. Feel free to pick
whatever you need there, as you know I don't really claim anything (and
there's even quite a sensible part of this code from you if my memory serves
me right). If you find a better split that could help further
experimentation, feel free to contribute it, but if it's too much a hassle,
do not spend your valuable time on this.
OK, will check what's best then.
Thanks!
Willy
|
Thank you for all the details. I think you have a plan here. |
This is the continuation of the discussion started at the end of #15.
@bschn2 you make a good point indeed. However then what can be done is to compute a pre-hash of the whole initial message used as an IV for AES transformations for example, and fill the whole RAM area with this. This guarantees you need to use that much writable memory. Then you can perform some random walks there and mix the data until you find a pattern that works. Then it will be the memory fill time which will be a bottleneck. It's actually nice as even on the A53, while the L1 cache data path is 64-bit for reads, it's still 128-bit for writes. So there is less of a bottleneck here for small CPUs. Just as an estimate, if we use 32-bit 1600 MT/s DDR3/4, that's 6.4 GB/s theorical peak, thus 10ms to fill 64 MB. You'd cap at 100 H/s on a small machine, is this too low ?
The text was updated successfully, but these errors were encountered: