Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular latency buffer sizes #12

Open
AndreiLux opened this issue Nov 5, 2016 · 1 comment
Open

More granular latency buffer sizes #12

AndreiLux opened this issue Nov 5, 2016 · 1 comment

Comments

@AndreiLux
Copy link

AndreiLux commented Nov 5, 2016

I've been trying to modify the latency benchmark to include more granular buffer access sizes to get a smoother latency curve, but seems I don't understand correctly how the algorithm works.

In particular I modified main control loop to increase the testsize by 1/2 of the previous full 2^n increment:

....
    else
        printf("\n");

    nbits = 10;
    
    for (niter = 0; (1 << nbits) <= size; niter++)
    {
        int testsize;
	
	if (niter % 2 == 0) 
		testsize = (1 << nbits++);
	else
		testsize = (1 << (nbits - 1)) + (1 << (nbits - 1)) / 2;
	
        xs1 = xs2 = ys = ys1 = ys2 = 0;

....

            t_before = gettime();
            random_read_test(buffer + testoffs, count, testsize);
            t_after = gettime();

.....

static void __attribute__((noinline)) random_read_test(char *zerobuffer,
                                                       int count, int testsize)
{
    uint32_t seed = 0;
    uintptr_t addrmask = testsize - 1;

This gives me the supposed increases that I wanted:

block size : single random read / dual random read
L1 :   10.9 ns          /    15.9 ns
      1024 :    0.0 ns          /     0.0 ns
      1536 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      3072 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      6144 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     12288 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     24576 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     49152 :    0.0 ns          /     0.0 ns
     65536 :    4.1 ns          /     6.1 ns
     98304 :    4.0 ns          /     6.1 ns
    131072 :    6.1 ns          /     8.0 ns
    196608 :    6.1 ns          /     8.0 ns
    262144 :   10.7 ns          /    13.6 ns
    393216 :   10.7 ns          /    13.6 ns
    524288 :   13.2 ns          /    16.1 ns
    786432 :   13.2 ns          /    16.1 ns
   1048576 :   22.4 ns          /    22.5 ns
   1572864 :   22.2 ns          /    24.8 ns
   2097152 :   93.2 ns          /   116.1 ns
   3145728 :   93.1 ns          /   115.4 ns
   4194304 :  123.7 ns          /   147.0 ns
   6291456 :  121.9 ns          /   145.3 ns
....

But as you notice in the figures the latencies don't actually change from the previous full 2^n figure.

Looking at the code in random_read_test I see that you limit the access pattern to a given memory range by simply masking the randomized index with a defined address mask. I of course changed the parameters as above to be able to pass the proper testsize instead of just nbits.

The resulting behaviour should theoretically work but obviously it seems I'm missing something as it doesn't work. As far as I see this shouldn't be an issue of the LCG (I hope). Do you have any input into my modifications or any feedback on other methods to change your random_read_test into accepting test sizes other than 2^n?

@ssvb
Copy link
Owner

ssvb commented Jan 29, 2017

Well, I could have a look if you provided a compilable test branch with this code. Using arbitrary sizes may require rescaling the offset via multiplication rather than masking out the higher bits.

There is just one potential problem with making the implementation more complex. We want to ensure that all the temporary variables from the inner loop are always allocated in CPU registers. If there are any spills to stack, then we need to implement this code in assembly (just like it is done for 32-bit ARM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants