Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIOT #84

Open
jordens opened this issue Dec 18, 2023 · 8 comments
Open

DIOT #84

jordens opened this issue Dec 18, 2023 · 8 comments

Comments

@jordens
Copy link
Member

jordens commented Dec 18, 2023

Since the topic comes up frequently and I couldn't find a place to tack this onto, here is an unsorted collection of things we noticed while considering moving away from EEM towards DIOT crates.

Problems we're facing with the current legacy Sinara EEM style:

  • Adding/removing/swapping cards requires disassembly of crate (hassle, risk, cost, usability limitation, and frustration for users). The ribbon cables don't have proper cable management and with more than a few it becomes dangerous/impossible to pull out and insert cards.
  • No hot-plugging: hot plugging happens by accident and is disastrous
  • Ribbon cables/IDC connectors have ill-defined impedance, lots of crosstalk, layered ribbon cables on Kasli defeat G+-G shielding topology, bad SI
  • Ribbon cables block airflow
  • Ribbon cables require large force to install/remove, can not be easily inserted/removed with modules in crate
  • IDC connectors have limited durability
  • Uniform standard ribbon cable length is too long for near modules requiring cable folding, increased crosstalk, air flow blockage, but too short with cards far away leading to tensioning
  • IDC connectors are physically large and occupy lots of board area
  • EEM 8/9 are offset from 0-7,10,11 leading to folding, uneven strain
  • IDC strain relief can't be used on 4 HP modules: fragile
  • A single ribbon cable over one non-strain-relief IDC connector is the limit to 4 HP width
  • Ribbon cables risk being scratched/worn by adjacent PCB (during installation and insertion/removal attempts)
  • Need to tie down ribbon cables with cable ties (ties need to be redone on each card change)
  • Standard ribbon cables are manufactured with both connectors pointing the same direction, not opposite as EEM requires: leads to manual and error-prone re-assembly of the ribbon cable with one side strain-relief to flip orientation
  • Very limited power through barrel connector and ribbon cables
  • Unspecified power envelope
  • Power through front panel requires elaborate EMI insolation and chokes at each egress/ingress
  • Ambiguous flexibility: unclear whether unused barrel connections can be used to power addtl devices
  • No slot monitoring of power, presence, health, identification
  • Immediate power to all modules is dangerous to FPGA pins, violating sequencing
  • No power switch: re-plugging barrel connector to cycle is risky, causes surge and wear
  • Lack of power sequencing causes surge
  • Barrel connectors easily come undone, screw connectors aren't standard
  • Powering flow is error-prone/ambiguous/dangerous: see Clocker/Kasli/Stabilizer/Banker/FastServo/HVAMP with PoE plus front and rear barrel/molex connector plus EEM power in all possible configurations and corner-cases, it's unclear where to power a crate
  • No well-defined clock distribution
  • MMCX only supplies 4 modules
  • clocking requires MMCX cables/tie-downs/insertion/removal hassle, connector wear, risk to connector damage
  • MMCX connector positions (Kasli/Kasli-SoC) are awkward but no space elsewhere
  • Since well-defined clock distribution can not be relied on, reference clocking options become an afterthought (Pounder, Fast Servo etc)
  • Standard MMCX cables are manufactured to have both connectors pointing the same direction, not opposite as EEM requires: leads to strain and twist of the MMCX cable
  • Kasli-SoC, Fast-Servo can't insert/swap SD card without disassembly of crate
  • 4 HP is too small for FPGA fans (Kasli, Kasli-SoC, Phaser) and too small for 8 SMA
  • Only 8 LVDS pairs per EEM connector leading to undesirable trade-offs (Urukul, Sampler, Grabber)
  • Connecting an unused EEM cable (EEM1 on Sampler) often leads to unexpected errors (LVDS FS)
  • Renumbering to increase/decrease EEM count on a module usually requires complete renumbering/reassembly/replugging of all modules
  • One-EEM vs two-EEM vs three-EEM is error-prone (DIP switches, LVDS fail safe), ground/power loops, unclear module identification and powering, choice overwhelms user
  • Mezzanines (Stabilizer, Mirny etc) are fragile, hard to debug, and electrically inferior
  • Many modules require forced air cooling (Kasli, Kasli-SoC, Phaser, HVAMP32, HVAMP-8CH etc) but the crate doesn't specify or provide any, leading to ad-hoc inferior solutions (small fans are noisy, fragile, unmonitored, uncontrolled, unreliable, have little mechanical compatibility for mounting), horizontal flow fans are against crate design, c.f. Kasli, Phaser fan/thermal design issues, adjacent 4 HP module blocks fan airflow, incompatible with mezzanine presence
  • Proper convective cooling or fans require spacing out the modules, complicating crate assembly and disassembly/module changes: requires additional/moved rails, additional blank front panels (e.g. 4 HP Phaser is useless in practice), fan mgmt, fan EMI handling
  • Space inefficient: due to cooling requirements, a crate with one Kasli and several modules is often not full (space-wise), but rarely empty enough to insert another Kasli with modules
  • Convective cooling holes in crate habitually blocked in rack installation
  • Mezzanine electrical and mechanical design is poor (no IDC spacer with correct length, nuts on carrier underside thicker than allowed, no reliable IDC connector alignment, IDC has bad SI, electrical interface undefined)
  • Mezzanine usage often requires different front panel (Almazny), removing some of the mezzanine flexibility advantage, leading to discarded front panels
  • JTAG/SWD connectors in varying positions with reduced accessibility (Kasli, Stabilizer) requiring crate disassembly
  • Use case prioritization is unclear and leads to inefficient designs/prioritization: e.g. Stabilizer-Kasli/Urukul EEM connectivity vs networked autonomy
  • Dumb modules (Clocker, Stabilizer, FastServo, HVAMP-8CH, Thermostat-EEM, ...) waste an EEM slot+cable just to be powered
  • EEM was not designed to last. It was a convenient quick-fix when we needed
    something working quickly. It has zero support or synergy outside Sinara.
  • No certification (CE etc)

DIOT downsides:

  • More expensive (backplane, connectors, receptacles vs IDC cables): would conceivably increase initial cost of a typical crate by 10% (? TBC)
  • Only 8 peripherals per crate (compared to max 12 in theory per Kasli with EEM)
  • Fixed board length (220 mm): no short DIO-BNC/SMA boards
  • Fixed board width (6 HP): no 8x BNC DIO or Sampler-BNC, no 4+4 carrier+mezzanine
  • Debugging/deployment/direct JTAG-flashing a card requires an adapter/extender (Kasli, Kasli-SoC)
  • Pounder/Driver mezzanines would need to become (a) RTMs, or (b) be consolidated with Stabilizer or (c) made thinner to fit into 6 HP (like DIOT/cPCI-S/FMC mezzanines): otherwise waste a slot
  • Thermostat controlling Zotino DAC would need to be an RTM (cabling), or become a thinner mezzanine to fit 6 HP, or be merged
  • Almazny redesign to fit 6 HP
  • Clocker reevaluation to (a) use empty space in crate, (b) merge into Kasli, (c) make more powerful to sensibly occupy a slot, (d) become stand-alone
  • HVAMP32/8 would need to be (a) an RTM, (b) external, or (c) consolidated with their driver, or (d) CPCIs/DIOT/FMC style mezzanine
  • IDC-SMA/BNC/MCX would need to become RTMs or external VHDCI/HD68 boards (otherwise waste slots)
  • ad-hoc "mods" (e.g. Urukul/Mirny tunable VCO as PLLs) require more planning
  • provokes instinctive NIH+FUD reaction

The downsides drive and reinforce the need for:

  • External break-outs (due to connector density), see existing external breakout architectures: counter fixed slot positions/reduce slot waste
  • High-density digital IO from a single peripheral slot to replace the various 8x/16x DIO boards (see Banker, VHDCI): counter peripheral count decrease
  • Consolidation of mezzanines onto their carrier (consolidated hardware Stabilizer+Pounder Pounder#112, Almazny etc), or redesign mezzanine architecture (more FMC like), or into RTMs, or into CPCIs/DIOT-style mezzanines: to counter fixed slot positions/reduce slot waste
  • Well-defined cooling architecture/requirements
  • Well-defined identification/flashing/JTAG architecture
  • Well-defined powering/sequencing/health and status monitoring architecture
@marmeladapk
Copy link
Member

Debugging/deployment/direct JTAG-flashing a card requires an adapter/extender (Kasli, Kasli-SoC)

Deployment and flashing can be done by a controller when it asserts SERVMOD for a peripheral slot. Debugging is more tricky, perhaps it could be done with BSCANE2 and routing of peripheral JTAG signals inside the FPGA?

@gkasprow
Copy link
Member

@jordens do you think it would be valuable to equip each board with I2C power monitor? Or just read the load from the power supply

@jordens
Copy link
Member Author

jordens commented Dec 18, 2023

For development, deployment, debugging etc maybe incremental total supply load is sufficient to infer per-board data.

@gkasprow
Copy link
Member

But this assumes a per-board power on mechanism controlled by Kasli. Currently the power on moment is dependent on the slot number.

@gkasprow
Copy link
Member

More expensive (backplane, connectors, receptacles vs IDC cables): would conceivably increase initial cost of a typical crate by 10% (? TBC)

This needs confirmation because one has to take into account existing overhead with wiring and debugging EEM

Only 8 peripherals per crate (compared to max 12 in theory per Kasli with EEM)

If we develop consolidated HW that doesn't waste slots, that shouldn't be an issue. We can also use higher density, 16-channel DIOs

Fixed board length (220 mm): no short DIO-BNC/SMA boards

The only difference is board cost but that's negligible in most cases

Fixed board width (6 HP): no 8x BNC DIO or Sampler-BNC, no 4+4 carrier+mezzanine

The only downside is the lack of support for 8 BNCs on the front panel. This can be fixed by adapters.
In most cases, the mezzanine can be fit into a 6HP area; use lower connectors.

Debugging/deployment/direct JTAG-flashing a card requires an adapter/extender (Kasli, Kasli-SoC)

As Pawel mentioned, that can be solved using the DIOT SERVMOD mechanism. We also developed DIOT debug adapters/riser cards

Pounder/Driver mezzanines would need to become (a) RTMs, or (b) be consolidated with Stabilizer or (c) made thinner to fit into 6 HP (like DIOT/cPCI-S/FMC mezzanines): otherwise, waste a slot

I wouldn't go for RTMs. 6HP is quite a lot of space; I think we will go for Kirdy instead of the Driver; in the case of Pounder, we can replace connectors and will fit. Are there plans to support Pounder in DIOT?

Thermostat controlling Zotino DAC would need to be an RTM (cabling), or become a thinner mezzanine to fit 6 HP, or be merged

We have more panel area, we can expose the TEC connector

Almazny redesign to fit 6 HP

Just replace SMA connectors with edge-mounted ones

Clocker reevaluation to (a) use empty space in crate, (b) merge into Kasli, (c) make more powerful to sensibly occupy a slot, (d) become stand-alone

We already integrated it with CERN DIOT System Board; new Kasli will have same approach

HVAMP32/8 would need to be (a) an RTM, (b) external, or (c) consolidated with their driver, or (d) CPCIs/DIOT/FMC style mezzanine

We have more board area, so we can add DAC to the board. We also have a bigger panel so we can expose the input connector on it.

IDC-SMA/BNC/MCX would need to become RTMs or external VHDCI/HD68 boards (otherwise waste slots)

We already have an HD68 breakout board.

ad-hoc "mods" (e.g. Urukul/Mirny tunable VCO as PLLs) require more planning
provokes instinctive NIH+FUD reaction

Just expose the VCO tuning connector on the panel using the SMA pigtail.

The downsides drive and reinforce the need for:

External break-outs (due to connector density), see [existing](https://github.com/sinara-hw/HD68_BNC_Breakout/wiki) [external breakout](https://github.com/sinara-hw/Banker/wiki) architectures: counter fixed slot positions/reduce slot waste
High-density digital IO from a single peripheral slot to replace the various 8x/16x DIO boards (see Banker, VHDCI): counter peripheral count decrease

As I mentioned above, we have isolated/non-isolated MCX. We can also make HD68 version of Banker

Consolidation of mezzanines onto their carrier (

sinara-hw/Pounder#112, Almazny etc), or redesign mezzanine architecture (more FMC like), or into RTMs, or into CPCIs/DIOT-style mezzanines: to counter fixed slot positions/reduce slot waste

All the mezzanines share the same design flaw - the ground loop. Consolidation is the natural step, especially when we can skip the Ethernet/PoE and free some board area. What's the advantage of using Ethernet in DIOT?

Well-defined cooling architecture/requirements

This was already decided and is part of the DIOT fan tray

Well-defined identification/flashing/JTAG architecture

FPGAs are already covered by remote JTAG (SERVMOD). The same mechanism can be used for microcontrollers.
This, of course, needs implementation in Kasli.

Well-defined powering/sequencing/health and status monitoring architecture

The main question is whether the controller should interfere with existing DIOT power sequencing based on geographical addressing.

@jordens
Copy link
Member Author

jordens commented Dec 21, 2023

@gkasprow I think you may have misread my list. I wanted to list the changes that are incurred by DIOT that still require varying levels of work to resolve.

Debugging/deployment/direct JTAG-flashing a card requires an adapter/extender (Kasli, Kasli-SoC)

As Pawel mentioned, that can be solved using the DIOT SERVMOD mechanism. We also developed DIOT debug adapters/riser cards

Let's see how that pans out. E.g. the most current DIOT spec I can find lists those as single ended while Pawel's KU peripheral does LVDS. I would also assume LVDS. But the entire machinery does add quite a bit of logic periphery on PBs and gateware in the SB. I'm aware of the adapters and risers.

I wouldn't go for RTMs. 6HP is quite a lot of space; I think we will go for Kirdy instead of the Driver; in the case of Pounder, we can replace connectors and will fit. Are there plans to support Pounder in DIOT?

On the contrary, IMO assuming Stabilizer and mezzanines/Thermostat etc connect via EEM or DIOT appears to be more of an anti-feature. I listed several reasons above.

We have more panel area, we can expose the TEC connector

IIRC given the tempco data, this also appears to be an anti-feature overall.

As I mentioned above, we have isolated/non-isolated MCX.

I am fully aware. I was stressing that these are the important modules and the problematic ones need to be deprecated.

We can also make HD68 version of Banker

How does that help?

All the mezzanines share the same design flaw - the ground loop.

I'm not sure I understand exactly which ground loop you mean and where that matters (compared to e.g. the ground loop that already exists through panels/crate and backplane/ribbon cables).

Consolidation is the natural step, especially when we can skip the Ethernet/PoE and free some board area. What's the advantage of using Ethernet in DIOT?

The initial idea was to use EEM/DIOT for synchronization and RT comms, Ethernet for wide bandwidth comms with non-RT stuff. But in the absence of a significant use-case and proper specs that connectivity will just linger around, diverge, and cost/confuse.

This was already decided and is part of the DIOT fan tray

Fully aware. That's what's needed.

Well-defined identification/flashing/JTAG architecture

FPGAs are already covered by remote JTAG (SERVMOD). The same mechanism can be used for microcontrollers. This, of course, needs implementation in Kasli.

I'm not sure whether it's practical to do SWD over that channel the way we are used to doing it with a probe. IMO CPUs will either go stand-alone (thus making debugging easy), or we'll use a riser, or we'll just plug it into the crate with the debugger attached. Field deployment will go DFU over USB.

Well-defined powering/sequencing/health and status monitoring architecture

The main question is whether the controller should interfere with existing DIOT power sequencing based on geographical addressing.

I couldn't find a reference or description about this. How does that work?
Pawel's DIO MCX and KU PFC just directly connect to P12V0 and on the backplane P12V0 is all common and not sequenced.
In any case: If it's the peripheral's job to sequence its power based on GA delay, then it should also measure its own power (if desired).

@gkasprow
Copy link
Member

Within a few weeks, we will finish Sinara's RnD phase of the DI/OT transition.

We defined and implemented a simple sequencing mechanism (#91)

Most of the adapter boards were eliminated;
HVAMP32 is now integrated with Fastino EEM
HVAMP8 was integrated with Fastino
We packed 8 BNCs into a 6HP panel, so Sampler-MCX becomes obsolete
We have 16-channel DIOs with SMA and MCX
Fastino/Zotino/Shuttler share the same high-performance mSAS connector, and we designed mSAS fanout box

To make debugging of DIOT easier, we made the extender boards
We built Kasli-DIOT to eliminate ugly adapter for standard Kasli.
Kasli-SOC will be replaced by CERN System Board. Thanks to CERN step pricing, it's much cheaper than the existing Kasli-SOC. We are working on porting ARTIQ.

To address the DIOT crate cost, we are designing simplified power delivery ; it will be single board plugged directly to the backplane.
DI/OT mechanics will be significantly simplified. The DIOT crate will consist of 3 components: standard Schroff chassis with fans, DIOT backplane (without RTM), Power unit, and extra screws for the backplane.

@sbourdeauducq
Copy link
Member

We are working on porting ARTIQ.

This is a substantial effort which will cost a lot of money and take a lot of time. As a reference, Zynq-7000 took about a year and cost about 150kEUR (including DRTIO, excluding maintainance/support/bugfixing and subsequent developments like DDMA/subkernels).

Overall I think it is worthwhile since it would bring performance improvements (faster 64-bit core), and pave the way for RFSoC support. But I'm not sure if we should depend on it for one of the most popular core devices - though of course we can still use the original Kasli-SoC and EEMs in the meantime.

I'm a little skeptical of the "CERN pricing" argument, it may not be such good prices in the first place (in general, hardware for scientists is very expensive), and even if the price were good, it may not be very reliable nor accessible to other manufacturers who can drastically cut manufacturing costs in other areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants