Skip to content
dh219 edited this page May 18, 2021 · 10 revisions

Getting the DSP to work under Rev 3

The DSP proved to be the hardest on-board nut to crack. It's peculiar in the Falcon in that it has only an 8 bit bus. This means all bus cycles to it have to be handled differently by the 030 and the circuitry to do that on the Falcon's motherboard is separate to all the other bus cycle logic.

The bus (port) width & the control signals

The 32 bit Motorola chips (020 and above) can handle accessing peripherals with varying bus sizes. They can support an 8 bit bus (wired to the top 8 data lines), a 16 bit bus (again, wired to the top 16) and the full 32 bits.

This works because when the 030 asks for data it's told at each cycle how big the data bus width (or port, as it calls it) is via the three cycle termination control lines: STERM, DSACK1 and DSACK0.

STERM always means a 32 bit port. This is never used when accessing the Falcon's motherboard as physically only 16 pins are connected.

DSACK[1:0], however can indicate three different port sizes. 00 means 32 bit (again, not used), 01 means 16 bit and, cruically, 10 means 8 bit (11 is the inactive state).

The Falcon doesn't give us DSACK[1:0], however, it gives us an ST-style DTACK signal. We therefore have to identify accesses to the DSP and translate that DTACK signal into DSACK = 10. All other calls to the ST bus are 16 bit so we normally translate it to 01.

At all times we're speaking about the width of the physical port. The number of data lines connected between the 030 and whatever it's speaking to. This is not the same as byte, word or longword transfers -- they are handled in as many cycles as is needed and are not related to port width.

The data strobes and caching

OK, I'd figured out the port size issue with Rev 2, but I'd missed one little thing: the CPU cache. When using a 16 or 32 bit port, the CPU expects all bytes to be populated around the address it's requesting, even if they're beyond the scope of the request.

Materially that means, for a 16 bit port, you only need to worry about UDS and LDS differences when writing. Reading, you ask for both upper and lower bytes and the CPU reads both every time, even if it only really wants the upper (even) or lower (odd) byte, caching the other one.

That doesn't work with the DSP as it has no upper nor lower bytes (it can only communicate one byte at a time!). The GALs translate the UDS and LDS signals to tell the DSP if A0 is odd or even, so it's important that only one of UDS and LDS are asserted when reading from the DSP. This applies equally to writes.

So, when accessing the DSP, the equation becomes very simple: UDS = A[0]; LDS = ~A[0];

XDTACK timing

The biggest issue, however, was caused by the GAL-generated XDTACK line.

As explained above, we don't have DSACK[1:0] passed through to us on the expansion port. Instead we get a 68000-like DTACK which is synthesised by the Falcon motherboard.

When accessing all the other peripherals and memory on the Falcon's bus, which is 16 bit, the COMBEL chip generates the DTACK signal which, in turn, gets translated to the (onboard processor's) DSACK1 signal. We do the same, we see XDTACK, we assert (our) DSACK1. Data strobe acknowledge, 16 bit port.

The DSP's 8 bit bus is handled differently, however. COMBEL doesn't generate DTACK, the GALs synthesise a custom DSACK0 (data strobe acknowledge, 8 bit port). This is also passed through to us on the expansion's DTACK line. We know we're addressing the DSP, so we can assert our own DSACK0 instead of DSACK1.

Shouldn't be a problem?

Unfortunately the logic we have in place to interface with the the COMBEL turned out not to be compatible with the GAL-defined way of generating DSACK signals.

Here's how the GAL-generated DSACK0 signal looks on a stock Falcon when a longword is written to the DSP port. It's clean and regular.

DSACK0 on stock Falcon

Here's how it looked when DFB1r3, with all the data strobe and DSACK translations (above) in place.

DSACK0 as generated when being accessed by DFB1r3

You'll notice the first assertion (the low parts) is much longer than in the stock version. Then there's a very short assertion only just after it's gone high. It goes high again immediately, then comes back with the (probably genuine) assertion an even shorter time later). This repeats for the next three bytes in the four-byte write. The overall length of the four-byte cycle is longer than on stock by the length of time the first cycle is elongated.

Ultimately, this is the problem. Because the clock the DFB1r3 is running to is necessarily slightly delayed from the real Falcon clock, our cycle starts a little too late, causing the assertion of the first DSACK0 to be slightly late, meaning the cycle takes another clock period before the deassert happens.

The 030 is ready to start the next cycle however and issues its address and data strobes. Unfortunately because the GALs base their logic on a one-clock-delayed version of the strobe signal -- which is still at this point asserted -- DSACK0 is issued immediately, until the next clock cycle starts, rapidly truncating it.

A small time later, the 'correct' time for issuing DSACK0 comes around and the line goes low again. Even this is truncated, however, as by now our 030 has completed it's bus cycle, reacting to the short runt assertion first seen. Strobes are released early, meaning DSACK0 goes high again -- too soon.

The cycle now starts again. Strobes are issued, but one (onboard) cycle ago, they were issued as well -- the GAL thinks we've had two cycles of strobe assert! We see another runt DSACK0... and so on.

Ultimately, this is a problem caused by clock skew. Because of the way DFB1r3 generates the CPU clock, we're often a little late starting. COMBEL deals with this without problem, but the GAL-derived DTACK needs more work.

The solution

Two solutions present themselves:

  • Mask the DTACK signal so that the DFB1 030 does not see the runt signal.
  • Delay the 030 assertion of the AS and (U/L)DS strobes to the start of the next 16MHz clock cycle so that the GALs do not issue the runt signal.

The first is probably the most technically elegant and will probably lead to the least latency.

The second, however, is simpler to implement in the DFB1 logic. The second, therefore is what I have chosen.

This does mean that writing four bytes now takes two (16MHz) cycles longer than stock. This could be a problem for high-data-rate DSP programs when not running at 50MHz mode (which would likely make up for it), but until such problems come to light, I am happy to report the DSP is now working with DFB1.