-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFv2 speed #15
Comments
Strange about these 30 H/s, it's 40 times slower than my i7 (mine does 1200). Did you build with -march=native -O3 ? |
My old ATOM D510 does 55 H/s :-) |
@wtarreau oh, I guess there is little missunderstanding. It gives me ~30 H/s per thread using cpu miner source code example from rfv2 branch. I just tried recompile test example with |
Interesting, what purpose of this |
On Thu, Apr 04, 2019 at 11:48:04PM -0700, iamstenman wrote:
@wtarreau oh, I guess there is little missunderstanding. It gives me ~30 H/s
per thread using cpu miner source code example from rfv2 branch. I just tried
recompile test example with `-march=native -O3` option and now it give me ~
130 H/s per thread :D
Much better!
… Interesting, what purpose of this `-march=native -O3` option?
-O3 is the optimisation level. It's the highest on most compilers.
Recently gcc introduced -Ofast which is more or less the same.
-march=native indicates that certain features present on the CPU
you're building on will be enabled. This is recommended when you
don't want to manually enable/disable each and every feature and
you know you're building to run on your local machine.
You can check the output of these :
$ gcc -dM -xc -E -< /dev/null | less
$ gcc -march=native -dM -xc -E -< /dev/null | less
-dM dumps all known macros
-xc says the source is a C file
-E says just emit the preprocessed output
So you end up with a lot of defines.
For example you'll find some AES, SSE, AVX and whatever is available
on *this* CPU that is not available on another one. This is why my
server's Atom was dying in "illegal instruction" previously when I
tried to run it from the executable made for the i7 :-)
Willy
|
Hello again @wtarreau. After some tests (~20 minutes mining on testnet on lowest difficulty without any blocks) I still think that RFv2 in current state way to slow :( |
On Fri, Apr 05, 2019 at 06:04:18PM +0000, iamstenman wrote:
Hello again @wtarreau. After some tests (~20 minutes mining on testnet on
lowest difficulty without any blocks) I still think that RFv2 in current
state way to slow :(
It's very possible, I don't know much what all this implies in fact :-)
Probably that lowering the number of rounds would help a lot, though
maybe some other parts of the algo are very expensive as well. What
would be an correct hashrate in your opinion ? I'm interested by pure
curiosity in checking on various devices (arm, x86) the impacts of
various knobs.
Willy
|
For example, I just checked and just changing this multiples the perf by
3 :
- if (__builtin_clrsbl(old) > 5) {
+ if (__builtin_clrsbl(old) > 4) {
(...)
- loops = sin_scaled(msgh) * 3;
+ loops = sin_scaled(msgh) + 1;
Maybe that's enough in your case ?
Willy
|
On Fri, Apr 05, 2019 at 09:34:08PM +0200, Willy Tarreau wrote:
For example, I just checked and just changing this multiples the perf by
3 :
- if (__builtin_clrsbl(old) > 5) {
+ if (__builtin_clrsbl(old) > 4) {
(...)
- loops = sin_scaled(msgh) * 3;
+ loops = sin_scaled(msgh) + 1;
Maybe that's enough in your case ?
I can even go further by removing two sets of divbox+scramble in
the round. My i7 reaches 4500 H/s there.
Willy
|
And I can reach 6k hashes/s on the i7 by flattening the curve giving the
number of loops to reduce the extremes :
static uint8_t sin_scaled(unsigned int x)
{
- return pow(sin(x / 16.0), 5) * 127.0 + 128.0;
+ return sqrt(sqrt(pow(sin(x / 16.0), 5) + 1.0)) * 100.0;
}
(rpi does 3900 here and npi 10300).
Willy
|
I was never a fan of this difficulty target calculation idea. The whole Bitcoin difficulty target calculation is a second class solution and does not really works. Imagine the hashrate goes very high and the difficulty increases to maximum. It is highly questionable even if it will be possible to create the needed hashes for MicroBitcoin as 21 trilions of coins need to be mined instead the 21 million coins bitcoin has. We may very well get into problems with not having enogh of hashes to be mined in MicroBitcoin as trillions of Coins are planned to be mined. Becouse of this lot of wasted hash calculation are produced and energy is wasted. Reducing the Loops as i see benefits big Power Hungry CPUs over small power efficient CPUs and is against the Idea of the RainForest Hash Algorithm. I suggest to go the path that EquiHash uses and is implemented as a example in Bitcoin Gold or ZCash and instead of finding a hash that is below I really recommend to drop this whole target calculation and instead use If my memory serves me right EquiHash creates 1 RamBox After this it looks for Hash Colusions inside this RamBox and if it finds it Becouse of this the Block Header of EquiHash has 80 Bytes plus the 32 Byte Hash colusion hash. Maybe something simillar can be done also with the RainForest Hash Algorithm ? |
Gentlemen, please be careful if you start to change the number of loops, iterations per round, or the loop curve! @wtarreau at least make sure you round the pow up by adding 1.5 and not 1.0. This aside, your suggestion looks reasonable to me as it gives more a body-like shape which stays longer in the higher range hence reduces the peaks. In any case you must check the average number of history buffer entries and double it in the define (well maybe less than double now with the curve change but it must at least be 20-25% larger to avoid repopulating the rambox from sratch at the end). If the number of entries is lower than 512 then you need to decrease the clrsbl threshold so that it writes more often. The divbox+scramble calls you suggested to remove were indeed here only to balance the power between low-end CPUs and high-end ones. There are exactly 3 which are doubled and could safely be reduced to 1 each (the original smhasher tests were run with both configurations). And yes, please keep an eye on raspberry-pi type of devices as it seems capital to me that such machines are about as fast as regular PCs if we really want to incentive to energy savings. Your numbers are fine to me as I initially targetted 1k to 10k H/s/core so we're pretty much in this area here. |
And how many times more faster it is on large devices and small devices for equihash ? Are you sure we don't favour the large ones only there ? |
Also the hash verifycation time count alot. I think that rainforest is great for this, it cost but not way two much. |
Dear @LinuXperia, rfv2 's rambox is not far from what you describe since the rambox is modified by every lookup based on the hashed message (thus includes the nonce). However keep in mind that Salsa20 was designed to be extremely fast on x86 processors (typically less than 4 cycles per byte), and that this can hardly be considered fair for emerging countries where such hardware simply is not available, all people have is a previous generation smartphone to do everything. There the power often comes from local solar panels so maintaining a decent capacity to such devices is very important for the overall fossil energy consumption. |
@bschn2 good catch for 1.5, I'll try this to stay safe. Thanks for confirming the divbox that could be reduced. Regarding the write ratio if you notice I already adjusted it, but granted, i didn't check the values. I'll do and will prepare a patch will all this soon. |
In this discussion #15 @wtarreau experimented with two sqrt() around the sine to pack values better. This experiment proves to provide smoother oscillations and can be simplified by reducing the exponentiation by two. Also the extra addition is simplified with a call to round(). With this we can safely divide the number of rounds by almost 3 and see the hash rate increase by as much. This needs to adjust the memory access ratios however (4 times more) which overall increases both performance and security.
@bschn2 here is my easy to implement improvement suggestion: The Problem is the part that require us to check for a specific rare hash with leading zeros in front of the Hash. My working approach that solves this problem without changing a lot of the code is that instead we look to find a rare end hash with leading zeros in front and be under a target hash we use the amount of the leading zeros in the nBits field value as amount of leading bytes of the calculated Nonce endhash to be matched in the rambox. Lets say the lowest dificlulty require us to have a hash with two leading zeros. The Value of how many leading Zeros are required is stored in the 80 Bytes pdata Blockheader that each thread has. So implementing this is very easy and should work like a charme. If somebody finds such a Byte End Hash Collusion very fast then the difficulty adjust automaticly the nbits field and require us to find a hash with let say 8 leading zeros now which is more harder than before. We again instead looking to find a endhash with 8 leading zeros with rainforest just match 8 bytes of each nonce end hash and look in the the rambox if such a Byte combination exist. If yes and this was found again faster than the 1 Minute requirement that Microbitcoin has for mining a block the difficulty Algorithm will adjust the nbits field value to find a hash with lets say 16 leading Zeros which for us and the Rainforest algoirthm means we need to map 16 Bytes of each calculated Nonce endhash in the rambox. If finding such a 16 Byte Hash Combination needs now 2 Minutes instead the required 1 Minute then the dificulty algotithm will drop the difficulty in the nbits field to look for 14 leading zeros aka 14 Bytes in the rambox and make it easier than before. By this it automaticly adjust everything so we stay inside the 1 Minute Time frame for mining a Block without that we need to bruteforce Hash calculations to find a rare hash with leadin zeros. |
Hello again @bschn2 @wtarreau. Line 726 in 3b35a37
And here is what I got:
Maybe this test would be helpful in some way :) |
@itwysgsl do you mean you don't find shares or you don't find blocks ? Not finding shares is indeed problematic but should simply be a matter of difficulty. Indeed, at 1.11kH/s you scan a full 16-bit range every 60 seconds so the pool's difficulty is low enough, you must find these shares. What is your difficulty in this case ? If what you don't find is a block, this sounds normal as the purpose is that the chances to find a block are equally shared among miners so if you have 1000 miners one will mine the block while the 999 other ones will not. So on average if you emit a new block every minute, for 1000 miners each of them would on average find a block every 1000 minutes. But again even then it's a matter of adjusting the difficulty as if the target is 0x0000FFFF...FFFF then at 1kH/s you will find it in 60 seconds. Last point, I'm a bit surprised by your i7's performance here, did you enable correct build options ? This is roughly 5 times slower than mine without dividing, hence 15 times slower. Did you enable -O3 and -march=native ? |
Oh and by the way, many thanks for sharing your observations! |
@LinuXperia I'm still unsure I really understand the principle you're describing. I think it is very similar to hashing except that you look up some bits in the rambox. What I don't understand is how you populate it and how you validate a share or a block afterwards. Also in any case the computation time spent is required as a proof of work. Whether you find the bits in the rambox or anywhere else inside the hash algorithm, it's the same, you have to iterate over nonces so that most participants find shares to be paid and that one of them finds the block. So it's unclear to me what your method brings at this point. |
@itwysgsl I am also surprised by your numbers, how do you test ? Is it with the patched cpuminer maybe ? Have you tried "rfv2_test -b -t $(nproc)" ? I must confess I have not checked how or when it initializes the rambox. I hope it does it only once once the first call, but I don't know. This could explain your low performance if it does a full rambox for each hash. |
@bschn2 Becouse of this Problem improvements are needed so a Single Miner using a Single Board Computer running Microbitcoin Rainforest Miner is able to mine a Block as Solo Miner in about 60 Seconds. So the minimal requirement to test on one Block Chain Node using one Miner to mine one Block in the given period failed. His hashing Numbers looks okey as he has the same Hash Speed on his i7 as i have on my i7 Becouse of this i suggested to abdon the bitcoin aproach looking to find a rare hash with leading zeros and instead use the equihash apraoch in finding bit collusions. Finding bit collusions make it easier to mine blocks as we dont need to bruteforce a huge amount of hashes and loose time till we found such a rare hash. |
@LinuXperia how do you measure this performance, and how is this lowest difficulty calculated or configured (sorry I'm not much aware of all this, I'm only using cpuminer to validate the thermal robustness of my build farm). With rfv2_test I'm seeing numbers 8 times larger than yours: |
@LinuXperia well, I really don't understand the method you're trying to explain, I'm sorry. I don't understand why you say "rare hash with leading zeroes", the number of zeroes is log2(1/frequency) so if a matching hash is rare it's because it has been made so by the difficulty. |
Well after having read a bit about equihash I think I get it a bit better, but in my humble opinion it focuses solely on the memory-bound aspect and as a result it has already been ported to an ASIC (bitmain's Z9 which is 10 times faster for the price than a GPU) : https://www.heise.de/newsticker/meldung/Ende-der-Grafikkarten-Aera-8000-ASIC-Miner-fuer-Zcash-Bitcoin-Gold-Co-4091821.html Also looking at the numbers, it's said that an Nvidia 1080Ti does only 650 sol/s (=hashes/s) so it's even way lower than what we're doing on rfv2. The main challenge we have to address is to make sure that MBC's short-lived blocks can be mined in the block's life, and the solution above apparently makes this situation worse from what I'm reading. |
Hello y'all, I've been following RF since last year and implemented V1 on my unreleased coin. |
Interesting. It's important to keep in mind that memory speed varies with the device's price. The DRAM access times I've measured so far: http://git.1wt.eu/web?p=ramspeed.git;a=blob;f=data/results.txt |
So this is what I was afraind of, equihash mostly target high performance hardware. I doubt I can run it on my Raspberry Pi! |
|
Still late on my work, I'm afraid it's not for today anymore. Wanted to let you know. |
Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message: Right now his miner, in average, produce 18 MH/s per miner. |
Also I tested RFv2 with recent patches on my android phone, and it produce around 10 kh/s. |
@itwysgsl his response is totally valid regarding the fact that others are attacking algorithms as well. What I'm contesting is not this. It's the fact that we're all working on a way to better decentralize mining, and that this person has the required skills to spot weaknesses before the algorithm is released, and silently keeps them secret incurring lots of work for you and the pools upon each release while disclosing such issues early would result in a more robust initial design. Thus I stand by my words, his primary motive is to grab most of the shares and not to help with fairness. With this said, I really doubt he achieves 18 MH/s per card, I suspect there are quite a number of cards per miner; the algorithm involves operations that are extremely fast on ARM, very fast on x86 and not natively implemented on many other platforms, thus costing more. So the base performance per core will be lower. Anyway, this helps fuel my ideas for a v3 :) I'm seeing on miningpoolstats that our latest update has done lots of good, with most miners going back to the public pools. Also there are mostly GPUs on zergpool and mostly CPUs on skypool. So far, so good. Regarding your phone, it's ARMv7, right ? If so it's pretty decent for such a platform which doesn't have AES, IDIV, CRC nor 64-bit native operations! Let's wait for some feedback from users of more recent phones using ARMv8 (with CPUs like Snapdragon or Kryo). |
@bschn2 the CPU vs GPU specialization of pools is natural, expected and nothing new : if you're having a low power (say a CPU) you'd rather join a pool where users are mostly like you so that you have a chance to get a respectable share every time a block's found, even if it's not frequent. But if you have 40 GPUs and can expect to help a pool mine a block 3 times more frequently and get 50% of the shares, you'd rather join a pool already showing a high hash rate because you'll get a big share frequently. And if you know you're concentrating most of the power, you'd rather join the pool with the lowest fees or run solo. And with few miners at the moment, MBC is extremely appealing to mine : even with moderate power you can expect to make a good share of 12.5k every 2 minutes or so. I watched yesterday and estimated that with only 1 MH/s it was possible to make roughly $25/day. This is why those with large enough power immediately jump onto such coins, they are extremely profitable at opening. Once more miners join, the revenue is more evenly spread and it's less possible to expect such high revenues. At the moment it seems to go mostly to Jareso though. @bschn2: most android phones with armv8 seem to enable armv7a only: https://i.stack.imgur.com/7EF24.jpg https://i.stack.imgur.com/XCGxi.jpg |
It could be indicative of an arch mismatch indeed. Have you tried to pass "-march=armv8-a+crypto+crc" for example when building, instead of "-march=native" ? |
@wtarreau nope, but I can try right now. |
Here is error message itself |
Can you please report the output of "gcc -v" ? I guess the test on the version is not correct indeed. I did this part myself to try to build on older compilers. In the worst case you can simply disable this block and see if your perf is better with the build options. |
@wtarreau I think Android Studio uses clang for compilation, but still here is output for
|
Ah, got it! It's clang indeed, so it advertises v4.2.1. We'd need to check for the clang version then, as your version appears to include this builtin while the one I tested last time didn't. In the mean time it's harmless, feel free to comment out the whole block to be able to retest the performance on armv8 mode with all extensions enabled. |
@wtarreau so, with |
Oh that's really funny, probably that certain architectural optimizations are lost. From what I've found, SD835 is an octa-A73 up to 2.45 GHz, it must be awesome! I'm having a hard time imagining that it can only be a compiler issue. Please try to use "-mcpu=cortex-a73+crypto+crc" instead, and add -m64 to be certain it builds in 64-bit mode. Also double-check that you didn't lose the optimization level (typically -O3/-Ofast) when forcing the other flags. |
I've updated the cpuminer PR with the latest speedups and fixes. |
@wtarreau with |
So, I tried it, but with |
Ah so at least it proves it's currently running in 32-bit which explains the lower performance. If it's cpuminer you're using, you should build it with -DNOASM. Last time I updated the code there I even added a script "build-linux-arm" or something like this because the default options didn't work for me. |
LOL gpu dev on the brink of a burnout |
Guys, I've reviewed some details of the algo and also the performance ratio mentioned above. I'm thinking that the main issue is that the whole rambox should be used to make sure there is no way to bypass it nor to precompute it. The problem is that memory controllers nowadays are extremely fast and prefetch tons of data on large systems (high-end CPUs, GPUs) but are dumb and unable to prefetch much on cheap hardware. However the cheap hardware possesses architectural optimizations (crc32, aes) that large ones partially possesses (aes for x86) and that GPUs don't have. If we consider the instruction-equivalent latency, by carefully chaining all this it should be possible to make everyone run at a bounded speed. Let's take my Neo4's 180ns RAM access time as a reference. This device has fast CPU cores with all extensions enabled. If we focus on a 200ns target as the time for a single operation, it means this machine must consume 20 more ns doing things that other machines will be slower at. My skylake has a 65ns access time. It does have AES but lacks CRC32. Fine, let's figure the fastest CRC32 implementation, repeat it as needed to fill. A raspi 3B+ shows 160ns, and it does have CRC32 but lacks AES. So we can put maybe one or two AES calls so that AES+RAM+CRC still gives 200ns. Note, the PC will be able to prefetch multiple accesses at once so multiple threads will benefit from this. But we don't need to read lots of data, the bandwidth is not interesting here. Let's say we use a 64 MB work area (26 bits). This can be turned to 24 bits by considering 32-bit words. An AES operation returns 128 bits. This can be sufficient to produce 5 memory addresses to fetch data from. You also want to always write so that there is no efficient way to keep a shared read-only copy of this. From what I'm reading at a few places, GPUs like the 1080Ti mentioned above seem to feature many memory channels (11, or 12 minus one to be more precise): The prefetch is a full 64 bytes cache-line like on many other CPUs. So a GPU could be 12 times faster than a PC or an SBC thanks to this. There's not much that can be done by adding extra instructions in the loop, as these could be amortized on the huge number of cores. However the memory size can limit the number of active cores. A GPU with 12 GB of RAM could only have 192 active threads with 64 MB of work area. But even with 192 cores at work, I expect they could perform well on AES or CRC. But in this case we're at 12 times the performance, not 100 times or so. Am I overlooking anything ? |
@wtarreau nice research, what you think about creating separate issue with discussion for Rainforest improvements? |
you're right, we're not that much in the speed area anymore :-) |
Hello! But let's open a new issue for this discussion. |
There was a report of incorrect speed on cpuminer with my PR there : tpruvot/cpuminer-multi#39. Is anybody aware of this ? |
On Sat, May 25, 2019 at 02:37:38PM -0700, MikeMurdo wrote:
There was a report of incorrect speed on cpuminer with my PR there :
tpruvot/cpuminer-multi#39. Is anybody aware of this ?
Yes! I mentioned it when I brought the speedup patches. The problem was
that the reported hashes count is used to compute how often to print the
stats, and if we reported the correct value, the stats output was reported
256 times faster. Now I figured how to fix this, I have a patch somewhere,
I'll send it eventually.
Willy
|
I finally uploaded my fixes in PR#39. |
Perfect I'll take it and rebuild my patch thanks man |
Do u have jareso-experimental miner? |
Hello again @bschn2, I have concerns about v2 speed. Don't you think that rainforest became way to slow? My MacBook's Intel i7 produce ~30 H/s and it's not enough to mine at least one block on lowest diff on test network 😅
The text was updated successfully, but these errors were encountered: