Kasli DMA sustained event rate #946

cjbe · 2018-03-05T15:14:29Z

The sustained DMA event rate is surprisingly low on Kasli. Using the below experiment, I find that shortest pulse-delay time without underflow for a TTL output is:

Opticlock: 530mu
DRTIO local TTL: 1150mu
DRTIO remote TTL: 490mu

For comparison, with the current KC705 gateware this is 128mu, and sb0 believes this should be closer to 48mu (3 clock cycles per event, https://irclog.whitequark.org/m-labs/2018-03-05)

(N.B. the RTIO clock for the DRTIO gateware is 150 MHz, vs 125 MHz for Opticlock)

Experiment:

class DMASaturate(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("core_dma")
        self.setattr_device("ttlo0")

        t_mu = self.get_argument("period", NumberValue(128))
        self.t_mu = np.int64(t_mu)

    @kernel
    def run(self):
        tp_mu = 8

        self.core.reset()

        with self.core_dma.record("ttl_local"):
            for _ in range(10000):
                self.ttlo0.pulse_mu(tp_mu)
                delay_mu(self.t_mu-tp_mu)

        h = self.core_dma.get_handle("ttl_local")

        self.core.break_realtime()
        for i in range(10):
            self.core_dma.playback_handle(h)

whitequark · 2018-03-05T15:17:48Z

Remote TTL is faster than local TTL?

cjbe · 2018-03-05T15:22:00Z

yes - remote is faster than local. I was surprised by this too, but verified that when there is no underflow I get the correct sequence (number of pulses on a counter) out of both the master and slave.

jordens · 2018-03-07T12:54:00Z

When RAM bound, then the throughput ratio of 4:1 (kc705:kasli) is expected from the data width.
The overall slowdown by a factor of 5 is unexpected.
The factor of ~2 when doing DRTIO on local TTL is also unexpected.

sbourdeauducq · 2018-03-09T04:24:10Z

That's due to the analyzer interfering (it is writing back to the memory the full DMA sequence, using IO bandwidth, causing bus arbitration delays, DRAM page cycles, etc.). With the analyzer disabled I get 207mu instead of ~1150mu.
No need to modify gateware, disabling it in the firmware is sufficient:

--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -223,8 +223,8 @@ fn startup_ethernet() {
     io.spawn(16384, session::thread);
     #[cfg(any(has_rtio_moninj, has_drtio))]
     io.spawn(4096, moninj::thread);
-    #[cfg(has_rtio_analyzer)]
-    io.spawn(4096, analyzer::thread);
+    //#[cfg(has_rtio_analyzer)]
+    //io.spawn(4096, analyzer::thread);
 
     let mut net_stats = ethmac::EthernetStatistics::new();
     loop {

sbourdeauducq · 2018-03-09T04:28:33Z

The KC705 is less affected because the wider DRAM words make linear transfers (which is what the DMA core and the analyzer are doing) more efficient. We could reach similar efficiency on Kasli by implementing optional long bursts in the DRAM controller, and supporting them in the DMA and analyzer cores.

cjbe · 2018-03-09T08:22:51Z

@sbourdeauducq I don't see how this should make local and remote TTL transactions take different time - could you reproduce this aspect?

cjbe · 2018-03-09T09:02:11Z

Right - if I am reading the SDRAM core correctly, it is currently not buffering reads and writes, or optimising access patterns. So on Kasli during a DMA sequence, in worst case of DMA and analyser data in same bank:

open row (2 cycles)
we do a read burst or two of 2*8=16 bytes, each of which has a latency of ~6 cycles (through phy etc)
then by a round robin arbiter, we accept the analyser write:
close row (2 cycles)
open different row (2 cycles)
we write one burst of 16 bytes, again ~6 cycles.
close row (2 cycles)
The sequence DMA entry for a TTL is 18 bytes, so we only do on average slightly over 1 read.
This is then ~22 cycles = 168ns
Plus a few cycles to write the RTIO event into the CRI, totals ~185ns

So this broadly tallies with the opticlock 530ns/2 = 265ns per event = 33 cycles, but does not explain the ~1.1 us per event.

Whereas reading/write a whole row would take 2+6+125+2=135 cycles for 2KB = 111x 18 byte RTIO events, or just over 1 cycle per event.
Hence without the RTIO analyser ~5 cycles per RTIO event taking into account the CRI write = 40ns
Or just a cycle or two extra for the RTIO analyser writeback, assuming it is cached similarly.

So, depending on the effort required, it seems well worth implementing long bursts.

sbourdeauducq · 2018-03-09T09:35:02Z

Here are the results I got:

With the original design, I get t_mu=650 for local, and t_mu=510 for remote. I see a difference but it is not as marked as in your case. Perhaps the difference is due to the CPUs making different DRAM accesses.
With the analyzer disabled, they both take about the same time: t_mu=~210
With the analyzer enabled and the patch below, also about the same time: t_mu=~670

diff --git a/artiq/gateware/rtio/dma.py b/artiq/gateware/rtio/dma.py
index 735d52f54..fd37c2ed1 100644
--- a/artiq/gateware/rtio/dma.py
+++ b/artiq/gateware/rtio/dma.py
@@ -331,13 +331,15 @@ class DMA(Module):
 
         flow_enable = Signal()
         self.submodules.dma = DMAReader(membus, flow_enable)
+        self.submodules.fifo = stream.SyncFIFO(self.dma.source.description.payload_layout, 16, True)
         self.submodules.slicer = RecordSlicer(len(membus.dat_w))
         self.submodules.time_offset = TimeOffset()
         self.submodules.cri_master = CRIMaster()
         self.cri = self.cri_master.cri
 
         self.comb += [
-            self.dma.source.connect(self.slicer.sink),
+            self.dma.source.connect(self.fifo.sink),
+            self.fifo.source.connect(self.slicer.sink),
             self.slicer.source.connect(self.time_offset.sink),
             self.time_offset.source.connect(self.cri_master.sink)
         ]

Here is what I propose:

The DRAM controller is extended to support long bursts of sequential write or read data. Those bursts can be long - maybe something like 16 or even 32 cycles.
In this proposed design, the long burst length must be less than the size of the DRAM page buffer. With DDR3, those buffers are typically 1 or 2 kilobytes long, so that's plenty.
Since long bursts will for a short time achieve 100% data I/O bus utilization (back-to-back transfers), all PHYs used in ARTIQ (Kintex-7, Artix-7, Kintex Ultrascale) should be reviewed to ensure that they operate correctly under this condition.
Non-burst transfer (e.g. CPU) performance is not affected.
DRAMs can tolerate significant refresh jitter and I don't expect much of a problem with refresh; we can just ignore the long bursts as long as the average refresh period is respected.
Bursts can be signaled using the specified way in the Wishbone standard.
This way, all the DRAM timing non-determinism and various overheads are pushed into the initiation of the burst, which is a comparatively small part of the transaction time. Furthermore, bursts can achieve very high data I/O bandwidth utilization (>90%).
The Migen synchronous FIFO core is extended to support low and high watermarks.
Such a FIFO is inserted into the analyzer core's DRAM write path. A long burst write begins when the high watermark is reached, which means that the FIFO contains at one long burst worth of data.
Flushing the FIFO upon analyzer shutdown is slightly tricky, it can be done by adding dummy records until the high watermark is reached exactly.
Another FIFO is inserted into the DMA core's DRAM read path. A long burst read begins when the low watermark is reached, which means that the FIFO has space for one long burst read worth of data.
The DMA core may overrun the sequence boundary, but this is of no practical consequence.
DMA and analyzer buffers must be aligned to long burst boundaries.

hartytp · 2020-09-07T11:22:23Z

@pca006132 how is the DMA performance on Zynq? Does the ARM RAM controller give better performance?

pca006132 · 2020-09-08T15:25:47Z

@pca006132 how is the DMA performance on Zynq? Does the ARM RAM controller give better performance?

There are some debug code and cache flushing in the current artiq-zynq master. With those removed (and moving the cache flush to another location), we can get to 65mu.

Note that this is because the handle is reused every time. Cache flushing is a pretty expensive operation... So the time that would take to get the handle is not negligible.

~~Note: This is not using ACP as it is not finished yet, I expect a bit better performance with ACP.~~
Edit: ACP would not be used for DMA due to low bandwidth.

hartytp · 2020-09-08T15:41:45Z

cool! That's a big step forwards. Is that with the analyzer enabled? I remember there being quite a long tail to the underflow distribution where we'd very occasionally find that sequences which would normally run with quite a bit of slack would underflow. If that's also reduced it would be wonderful...

pca006132 · 2020-09-08T15:46:51Z

cool! That's a big step forwards. Is that with the analyzer enabled? I remember there being quite a long tail to the underflow distribution where we'd very occasionally find that sequences which would normally run with quite a bit of slack would underflow. If that's also reduced it would be wonderful...

yes, analyzer is enabled, I could get some analyzer output:

OutputMessage(channel=4, timestamp=17094553753, rtio_counter=17094549496, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553761, rtio_counter=17094549528, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553818, rtio_counter=17094549560, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553826, rtio_counter=17094549592, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553883, rtio_counter=17094549624, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553891, rtio_counter=17094549656, address=0, data=0)
OutputMessage(channel=4, timestamp=17094553948, rtio_counter=17094549688, address=0, data=1)
OutputMessage(channel=4, timestamp=17094553956, rtio_counter=17094549720, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554013, rtio_counter=17094549752, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554021, rtio_counter=17094549784, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554078, rtio_counter=17094549816, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554086, rtio_counter=17094549848, address=0, data=0)
OutputMessage(channel=4, timestamp=17094554143, rtio_counter=17094549880, address=0, data=1)
OutputMessage(channel=4, timestamp=17094554151, rtio_counter=17094549912, address=0, data=0)

So it should be working correctly I think.

jordens added area:coredevice area:dma area:speed labels Mar 7, 2018

sbourdeauducq added the type:needs-funding label Mar 17, 2018

sbourdeauducq added a commit that referenced this issue Aug 9, 2018

test: skip test_dma_playback_time on Kasli (#946)

052e400

hartytp mentioned this issue Jan 29, 2020

DMA too slow for Fastino #1423

Closed

dhslichter mentioned this issue Jan 29, 2020

Memory bus width sinara-hw/Kasli-SOC#21

Closed

hartytp mentioned this issue Sep 21, 2020

DMA and wide RTIO issues #1521

Closed

sbourdeauducq mentioned this issue Nov 30, 2020

DMA: Increase buffer size #1552

Closed

sbourdeauducq removed the type:needs-funding label Aug 1, 2024

sbourdeauducq assigned occheung Aug 1, 2024

sbourdeauducq added this to the ARTIQ-9 milestone Aug 1, 2024

occheung linked a pull request Oct 10, 2024 that will close this issue

Implement Burst Memory Access for RTIO DMA & Analyzer #2592

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kasli DMA sustained event rate #946

Kasli DMA sustained event rate #946

cjbe commented Mar 5, 2018

whitequark commented Mar 5, 2018

cjbe commented Mar 5, 2018

jordens commented Mar 7, 2018

sbourdeauducq commented Mar 9, 2018

sbourdeauducq commented Mar 9, 2018 •

edited

Loading

cjbe commented Mar 9, 2018

cjbe commented Mar 9, 2018

sbourdeauducq commented Mar 9, 2018 •

edited

Loading

hartytp commented Sep 7, 2020

pca006132 commented Sep 8, 2020 •

edited

Loading

hartytp commented Sep 8, 2020

pca006132 commented Sep 8, 2020

Kasli DMA sustained event rate #946

Kasli DMA sustained event rate #946

Comments

cjbe commented Mar 5, 2018

whitequark commented Mar 5, 2018

cjbe commented Mar 5, 2018

jordens commented Mar 7, 2018

sbourdeauducq commented Mar 9, 2018

sbourdeauducq commented Mar 9, 2018 • edited Loading

cjbe commented Mar 9, 2018

cjbe commented Mar 9, 2018

sbourdeauducq commented Mar 9, 2018 • edited Loading

hartytp commented Sep 7, 2020

pca006132 commented Sep 8, 2020 • edited Loading

hartytp commented Sep 8, 2020

pca006132 commented Sep 8, 2020

sbourdeauducq commented Mar 9, 2018 •

edited

Loading

sbourdeauducq commented Mar 9, 2018 •

edited

Loading

pca006132 commented Sep 8, 2020 •

edited

Loading