Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMA too slow for Fastino #1423

Closed
pathfinder49 opened this issue Jan 29, 2020 · 18 comments
Closed

DMA too slow for Fastino #1423

pathfinder49 opened this issue Jan 29, 2020 · 18 comments

Comments

@pathfinder49
Copy link
Contributor

Bug Report

One-Line Summary

Updating multiple Fastino channels @2.55 MS/s via DMA results in RTIO underflow.

Issue Details

Using the Fastino single channel update functionality, the maximum sample rate can not be achieved on all channels. Without DMA, Kasli can only update a single channel at ~1.3 MS/s. Using DMA I found the following:

Number of Channels underflow after n samples per channel (rounded up)
32 8
16 17
8 41
4 ~200
2 ~600 (stochastic)
1 >40_000

Steps to Reproduce

The experiment below demonstrates the bug. Underflow time was determined by measuring the Fastino output and/or finding after which sequence length underflows stopped occurring.

from artiq.experiment import *
import numpy as np

class FastinoTest(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("core_dma")
        self.setattr_device("fastino_0")
        print("build")

    def run(self):
        n = 1 << 2
        self.s = [self.fastino_0.voltage_to_mu(9.9*np.cos(2*np.pi*i/n))
            for i in range(n)]
        self.do()

    @kernel
    def do(self):
        self.core.reset()
        f = self.fastino_0
        self.record(f)
        sinusoid_handle = self.core_dma.get_handle("sinusoid")
        self.core.break_realtime()
        f.init()
        f.update(0xffffffff)
        delay(1*us)

        for i in range(10):
            f.set_leds(0xaa)
            delay(.1*s)
            f.set_leds(0x55)
            delay(.1*s)
        print("start DMA")
        self.core.break_realtime()
        self.core_dma.playback_handle(sinusoid_handle)

        self.core.wait_until_mu(now_mu())

    @kernel
    def record(self, f):
        with self.core_dma.record("sinusoid"):
            k0 = 14*7//2*8*2
            k0 = k0 // 2
            n_ch = 32
            k0 = k0 // n_ch

            for i in range(1000):
                for j in range(len(self.s)):
                    for ch in range(n_ch):
                        f.set_dac_mu(ch, self.s[j])
                        delay_mu(k0)

Your System (omit irrelevant parts)

  • Hardware involved: Kasli, Fastino
@hartytp
Copy link
Collaborator

hartytp commented Jan 29, 2020

I believe this is a symptom of the more general issue #946

@hartytp
Copy link
Collaborator

hartytp commented Jan 29, 2020

@jordens we discussed getting data on RTIO/DMA throughput for Kasli. Were there any measurements in particular you wanted?

@sbourdeauducq
Copy link
Member

I am pretty sure that the optimization I discussed in #946 will yield good results, and getting measurements should not be in the critical path - though it is good to make them at some point to quantify the improvement.

@hartytp
Copy link
Collaborator

hartytp commented Jan 29, 2020

@pathfinder49 you can also try with the core analyzer disabled in the gateware. That should speed things up a decent amount (see #946)

@jordens
Copy link
Member

jordens commented Jan 29, 2020

@hartytp You should make the measurements that tell you whether your use case will be limited by RTIO frabric/DMA throughput. You know best what those are.

@dhslichter
Copy link
Contributor

I would describe this as just a design flaw in Fastino, if I understand correctly. If you want Kasli to stream 32 channels of 16-bit data at 2.5 MS/s, this is 160 MB/s sustained data transfer for EACH Fastino card, using a bus that is shared for many other ARTIQ purposes, and with latency constraints that are much more stringent than for typical computing applications.

It seems to me -- and please correct me if I am wrong -- that modifications to DMA may provide a bit of a patch, or adding much larger buffers to reduce the impact of all the other bus traffic on the ability to guarantee samples at the DAC on time, but that really what one should consider is having a dedicated SDRAM on Fastino for waveform playback, using a dedicated bus that is optimized for the task, and properly sized buffering queues on the Fastino FPGA to allow for memory refresh. Then have the DMA recording process place samples into the Fastino memory rather than Kasli memory.

@jordens
Copy link
Member

jordens commented Jan 29, 2020

You are wrong. The limitation is definitely not Fastino. Fastino has no problem handling all the samples you throw at it. Hardware is not the limitation: the SDRAM on Kasli can already sustain an order of magnitude more than 160 MB/s.
And already with just the interpolator (that's coming up) or any other way of generating the samples at the PHY (wide RTIO for Fastino is yet another approach in the pipeline) there is also no problem at all to feed it from Kasli.
You may notice that Shuttler (which you appear to be peddling by postulating design flaws in Fastino) will have the exact same issue. Whether you optimize DMA on Kasli or DMA on Shuttler doesn't matter much for this problem. On the contrary: improving RTIO DMA is better overall value for money than building a high performance Shuttler DMA solution.
It was a deliberate choice to not put memory and DMA on Fastino and keep it a slim multichannel DAC with a very low level data link. This design is a unique success story that enabled building it in record time. DMA and SDRAM would have been a very complicated, risky, time-consuming, and costly thing to do. Even pairing a dedicated Kasli satellite with a couple Fastinos to feed them from local memory looks more pragmatic and certain than alternative options.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Jan 30, 2020

modifications to DMA may provide a bit of a patch

That's not just "a bit of patch", I expect the improvement to be very significant. For linear access and a fine-tuned command pattern, one can use ~99% of the peak I/O bandwidth of a SDRAM, i.e. the Kasli SDRAM could do ~15.8Gbps on Kasli 2.0 with the -3 speed grade FPGA, and more if we do not phase-lock the SDRAM clock to the CPU clock (not doing it increases latency and complexity). Not counting RTIO-DMA overhead and analyzer writeback (maybe the analyzer could ignore Fastino data channels?), a Fastino at maximum speed is just 1.3Gbps.
The issue is nobody funded DMA on Kasli yet, it was developed for KC705 where the wide memory bus makes this optimization unnecessary without any particularly bandwidth-demanding application on the horizon.
Besides, Fastino can also be potentially used with gateware that produces samples on-the-fly in the FPGA (like SU-Servo and SAWG do), and then memory bandwidth isn't part of the equation.

@dtcallcock
Copy link

Out of interest, how do you expect DMA to perform on Kasli-SOC?

Will #946 also be applicable to Kasli-SOC?

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Jan 30, 2020

Out of interest, how do you expect DMA to perform on Kasli-SOC?

Like this: https://www.embeddedrelated.com/showarticle/988.php

Will #946 also be applicable to Kasli-SOC?

The #946 optimizations are not applicable to Zynq, and would result in the same or higher performance than Zynq can possibly achieve (at least without adding a dedicated SDRAM chip for fabric DMA, which sounds complicated and expensive). I suspect higher. FPGA fabric is efficient at moving parallel/pipelined data around, and allows the fine-tuning of low-level SDRAM command sequences - there is no clear advantage to the hard SDRAM controller used in Zynq and plenty of disadvantages (it is sometimes faster to design something from scratch in the fabric than get the quirky/buggy Zynq hardware to behave).

@pathfinder49
Copy link
Contributor Author

@pathfinder49 you can also try with the core analyzer disabled in the gateware. That should speed things up a decent amount (see #946)

I can't seem to find how to disable the core analyser. Could someone please give me a hint?

@marmeladapk
Copy link
Contributor

#946 (comment)

@dhslichter
Copy link
Contributor

You are wrong.

OK, I stand corrected.

You may notice that Shuttler (which you appear to be peddling by postulating design flaws in Fastino) will have the exact same issue.

I'm not trying to "peddle" Shuttler. It's aimed at a different use case than Fastino, and I am not trying to suggest that people should choose Shuttler instead. I agree that the problem of feeding the DACs their samples will be even worse on Shuttler than Fastino. And maybe the answer for Shuttler is to stream reduced-representation samples out from a Kasli over EEM, and have the FPGA on Shuttler just be in charge of turning that into DAC samples in whatever the appropriate manner is (CIC, spline, etc). This would certainly allow Shuttler to reuse/build on developments made for Fastino, which is good.

Whether you optimize DMA on Kasli or DMA on Shuttler doesn't matter much for this problem. On the contrary: improving RTIO DMA is better overall value for money than building a high performance Shuttler DMA solution.

I agree that improving general RTIO DMA is much more useful than doing some specific optimized use case. I didn't understand that the proposal from @sbourdeauducq in #946 (comment) would have such a dramatic impact as to completely alleviate all concerns for running multiple Fastinos off a single Kasli. This changes my thoughts on whether Shuttler would need its own DMA.

It was a deliberate choice to not put memory and DMA on Fastino and keep it a slim multichannel DAC with a very low level data link. This design is a unique success story that enabled building it in record time.

I would contend while the hardware layout/debugging may be a "unique success story", this label seems inconsistent with the issue that started this post, namely the inability to run even a single one of the 32 output channels at the spec'ed max update rate. But if making significant improvements to RTIO DMA fixes this completely, then sure, it has the potential to be a "unique success story". Likewise, once a PHY exists to generate the samples from reduced representations rather than needing to stream them from memory, the issue would be solved. According to @hartytp there is no funding for the required improvements to RTIO DMA -- again, please correct me if I am wrong here. So it remains to be seen when Fastino can actually realize its potential, right? The PHY for sample generation has been funded by Hannover, is that right?

DMA and SDRAM would have been a very complicated, risky, time-consuming, and costly thing to do. Even pairing a dedicated Kasli satellite with a couple Fastinos to feed them from local memory looks more pragmatic and certain than alternative options.

Yes I agree, and this may be the way to do Shuttler, for example.

@jordens
Copy link
Member

jordens commented Jan 30, 2020

inability to run even a single one of the 32 output channels at the spec'ed max update rate

I have been running 32 channels with full update rate just fine. This has been shown. The noise measurements have been done using this mode. There is not a single aspect of the Fastino project that would prevent this. Fastino can fully realize its potential. Therefore it's completely accurate to label it a unique success story.

Maybe your confusion is just a lack of knowledge of how Fastino works: The only mode of operation currently is to constantly stream all 32 channels at 2.55 MS/s from Kasli. Fastino either works at full rate or it doesn't work at all.

The fact that RTIO and DMA would generally limit high event rates was clear months ago (in fact years since #946). It was clearly communicated and explicitly accepted that feeding a (any) RTIO PHY with arbitrary waveforms on many channels would likely not work. But this is in no way implied or triggered by or attributable to Fastino.

You can just use the labels we place on these issues for this purpose. There is no secret or back room dealing going on.
The current PHY with wide and narrow interfaces is funded by Oxford. The configurable CIC interpolator is funded by LUH.

@dhslichter
Copy link
Contributor

I have been running 32 channels with full update rate just fine. This has been shown. The noise measurements have been done using this mode. There is not a single aspect of the Fastino project that would prevent this. Fastino can fully realize its potential. Therefore it's completely accurate to label it a unique success story.

OK. Then what's the difference with what @pathfinder49 is doing? And yes, I have a lack of knowledge about Fastino, I have not been following the development due to bandwidth limitations. My apologies for coming in late and not having all the background story.

The fact that RTIO and DMA would generally limit high event rates was clear months ago (in fact years since #946).

Sure, not debating that, and I was not trying to say this was "caused" by Fastino. But it seemed from this initial post like it was not possible to run Fastino by streaming DMA samples, and I had the impression from @hartytp in an email yesterday that the funding was not in place to address this issue.

@jordens
Copy link
Member

jordens commented Jan 30, 2020

Then what's the difference with what @pathfinder49 is doing?

He is trying to generate arbitrary samples using RTIO and/or DMA on Kasli.
I generate (or repeat) them in gateware at/before the PHY.

Sure, not debating that, and I was not trying to say this was "caused" by Fastino.

If it's not attributable to Fastino, and if it furthermore can and should be resolved elsewhere, then it's certainly wrong to claim that there is a design flaw in Fastino.

But it seemed from this initial post like it was not possible to run Fastino by streaming DMA samples, and I had the impression from @hartytp in an email yesterday that the funding was not in place to address this issue.

That's correct for this literal use case. Maybe the wide RTIO interface addresses it, maybe it won't. The interpolator will certainly address it for its use cases.

@hartytp
Copy link
Collaborator

hartytp commented Jan 30, 2020

The limitation is definitely not Fastino. Fastino has no problem handling all the samples you throw at it. Hardware is not the limitation: the SDRAM on Kasli can already sustain an order of magnitude more than 160 MB/s.

Whether you optimize DMA on Kasli or DMA on Shuttler doesn't matter much for this problem. On the contrary: improving RTIO DMA is better overall value for money than building a high performance Shuttler DMA solution.

Indeed.

Not to mention that #946 was an issue long before Fastino/Shutter/anything else. DMA throughput is already a significant bottleneck. You just start to notice it even more when you try to do things like like fast shuttling.

Anyway, fun as it's been, I think this is a duplicate of #946 so closing.

@hartytp hartytp closed this as completed Jan 30, 2020
@hartytp
Copy link
Collaborator

hartytp commented Jan 30, 2020

The interpolator will certainly address it for its use cases.

Indeed. But issues are still likely if one wants to update, say 10 or more, the DAC decently simultaneously for any length of time at anything close to the max rate the hw can support. As you say though, that's not new or surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants