diff --git a/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-ooo.svg b/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-ooo.svg new file mode 100644 index 0000000000..7ef8f8fe97 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-ooo.svg @@ -0,0 +1,4 @@ + + + +
Active Control Plane
Active Control Plane
Active Data Plane
Active Data Plane
Standby Data Plane
Standby Data Plane
State Change
State Chan...
Bulk Sync Start
Bulk Sync...
Old
Old
New
New
ACK
ACK
Inline Sync
Inline Sync
New
New
Old
Old
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-per-flow-version.svg b/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-per-flow-version.svg new file mode 100644 index 0000000000..8a4be482e0 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-bulk-sync-multichannel-per-flow-version.svg @@ -0,0 +1,4 @@ + + + +
Active Control Plane
Active Control Plane
Active Data Plane
Active Data Plane
Standby Data Plane
Standby Data Plane
State Change
State Chan...
Bulk Sync Start
Bulk Sync...
Old, V: X
Old, V: X
New, V: X + 1
New, V: X + 1
ACK
ACK
Inline Sync
Inline Sync
New, V: X + 1
New, V: X + 1
New, V: X + 1
New, V: X + 1
Old, V: X
Old, V: X
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg b/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg index f22f8cb6a1..c15908a98c 100644 --- a/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg +++ b/doc/smart-switch/high-availability/images/ha-control-plane-overview.svg @@ -1,4 +1,4 @@ -
Upsteam Service
Upsteam Service
Smart Switch 2
Smart Switch 2
DPU Card 3
DPU Card 3
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
DPU Card 4
DPU Card 4
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
NPU
NPU
ha
ha
swbusd
swbusd
mgmt
mgmt
gNMI Agent
gNMI Agent
database
database
Redis
Redis
swss
swss
orchagent
orchagent
BP-Port
BP-Port
PCIe Bus
PCIe Bus
Syncd
Syncd
syncd
syncd
ASIC
ASIC
hamgrd
hamgrd
FP-Ports
FP-Ports
Mgmt Port
Mgmt Port
Smart Switch 1
Smart Switch 1
DPU Card 1
DPU Card 1
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
Mgmt Port
Mgmt Port
DPU Card 2
DPU Card 2
PCIe Bus
PCIe Bus
database
database
Redis
Redis
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
*orch*
*orch*
syncd
syncd
syncd
syncd
BP-Port
BP-Port
NPU
NPU
PCIe Bus
PCIe Bus
ha
ha
hamgrd
hamgrd
swbusd
swbusd
mgmt
mgmt
gNMI Agent
gNMI Agent
database
database
Redis
Redis
swss
swss
orchagent
orchagent
BP-Ports
BP-Ports
Syncd
Syncd
syncd
syncd
ASIC
ASIC
FP-Ports
FP-Ports
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Data Plane Channel
Data Plane Channel
Text is not SVG - cannot display
\ No newline at end of file +
Smart Switch 0
Smart Switch 0
DPU Card 0
DPU Card 0
PCIe Bus
PCIe Bus
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
orchagent
orchagent
syncd
syncd
syncd
syncd
BP-Port
BP-Port
pmon
pmon
FP-Ports
FP-Ports
NPU
NPU
DPUx Control Plane
DPUx Control Plane
DPU0 Control Plane
DPU0 Control Plane
ha-dpu0
ha-dpu0
hamgrd
hamgrd
swbusd
swbusd
gnmi-dpu0
gnmi-dpu0
gNMI
gNMI
database-dpu0
database-dpu0
Redis
Redis
BP-Ports
BP-Ports
PCIe Bus
PCIe Bus
ASIC
ASIC
pmon
pmon
database
database
Redis
Redis
syncd
syncd
syncd
syncd
swss
swss
orchagent
orchagent
gnmi
gnmi
gNMI Spliter
gNMI Spliter
gNMI
gNMI
Mgmt Port
Mgmt Port
DPU Card 1
DPU Card 1
DPU Card N
DPU Card N
...
...
Upsteam Service
Upsteam Service
Smart Switch 0
Smart Switch 0
DPU Card 0
DPU Card 0
PCIe Bus
PCIe Bus
DPU ASIC / ARM Cores
DPU ASIC / ARM Cores
swss
swss
orchagent
orchagent
syncd
syncd
syncd
syncd
BP-Port
BP-Port
pmon
pmon
Mgmt Port
Mgmt Port
NPU
NPU
DPUx Control Plane
DPUx Control Plane
DPU0 Control Plane
DPU0 Control Plane
ha-dpu0
ha-dpu0
hamgrd
hamgrd
swbusd
swbusd
gnmi-dpu0
gnmi-dpu0
gNMI
gNMI
database-dpu0
database-dpu0
Redis
Redis
PCIe Bus
PCIe Bus
gnmi
gnmi
gNMI Spliter
gNMI Spliter
gNMI
gNMI
database
database
Redis
Redis
swss
swss
orchagent
orchagent
BP-Ports
BP-Ports
syncd
syncd
syncd
syncd
ASIC
ASIC
pmon
pmon
FP-Ports
FP-Ports
DPU Card 1
DPU Card 1
DPU Card N
DPU Card N
...
...
Legend
Legend
Upstream service channel
Upstream service channel
HA control plane data channel
HA control plane data channel
HA control plane control channel
HA control plane control channel
Internal redis/zmq based channel
Internal redis/zmq based channel
Data Plane Channel
Data Plane Channel
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/images/ha-scope-dpu-level.svg b/doc/smart-switch/high-availability/images/ha-scope-dpu-level.svg new file mode 100644 index 0000000000..10c9c96f56 --- /dev/null +++ b/doc/smart-switch/high-availability/images/ha-scope-dpu-level.svg @@ -0,0 +1,4 @@ + + + +
Smart Switch 0
NPU1
DPU1
(Standby)
ENI1
ENI2
Smart Switch 0
NPU1
DPU0
(Active)
ENI1
ENI2
All NPUs advertise the same VIP range to network.
NPU matches DST IP to find the active DPU and forward the packet.
\ No newline at end of file diff --git a/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md b/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md index d53ab84e6f..a74861b40b 100644 --- a/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md +++ b/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md @@ -3,53 +3,82 @@ | Rev | Date | Author | Change Description | | --- | ---- | ------ | ------------------ | | 0.1 | 10/14/2023 | Riff Jiang | Initial version | - -1. [1. Database Schema](#1-database-schema) - 1. [1.1. High level data flow](#11-high-level-data-flow) - 2. [1.2. NPU DB schema](#12-npu-db-schema) - 1. [1.2.1. CONFIG DB](#121-config-db) - 1. [1.2.1.1. DPU / vDPU definitions](#1211-dpu--vdpu-definitions) - 2. [1.2.1.2. HA global configurations](#1212-ha-global-configurations) - 2. [1.2.2. APPL DB](#122-appl-db) - 1. [1.2.2.1. HA set configurations](#1221-ha-set-configurations) - 2. [1.2.2.2. ENI placement configurations](#1222-eni-placement-configurations) - 3. [1.2.2.3. ENI configurations](#1223-eni-configurations) - 3. [1.2.3. State DB](#123-state-db) - 1. [1.2.3.1. DPU / vDPU state](#1231-dpu--vdpu-state) - 4. [1.2.4. DASH State DB](#124-dash-state-db) - 1. [1.2.4.1. DPU / vDPU HA states](#1241-dpu--vdpu-ha-states) - 2. [1.2.4.2. ENI state](#1242-eni-state) - 3. [1.3. DPU DB schema](#13-dpu-db-schema) - 1. [1.3.1. APPL DB](#131-appl-db) - 1. [1.3.1.1. HA set configurations](#1311-ha-set-configurations) - 2. [1.3.1.2. DASH ENI object table](#1312-dash-eni-object-table) - 2. [1.3.2. State DB](#132-state-db) - 1. [1.3.2.1. ENI HA state](#1321-eni-ha-state) -2. [2. Telemetry](#2-telemetry) - 1. [2.1. HA state](#21-ha-state) - 2. [2.2. HA operation counters](#22-ha-operation-counters) - 1. [2.2.1. hamgrd HA operation counters](#221-hamgrd-ha-operation-counters) - 2. [2.2.2. HA SAI API counters](#222-ha-sai-api-counters) - 3. [2.3. HA control plane communication channel related](#23-ha-control-plane-communication-channel-related) - 1. [2.3.1. HA control plane control channel counters](#231-ha-control-plane-control-channel-counters) - 2. [2.3.2. HA control plane data channel counters](#232-ha-control-plane-data-channel-counters) - 1. [2.3.2.1. Per bulk sync flow receive server counters](#2321-per-bulk-sync-flow-receive-server-counters) - 2. [2.3.2.2. Per ENI counters](#2322-per-eni-counters) - 4. [2.4. NPU-to-DPU tunnel related (NPU side)](#24-npu-to-dpu-tunnel-related-npu-side) - 1. [2.4.1. NPU-to-DPU probe status](#241-npu-to-dpu-probe-status) - 2. [2.4.2. NPU-to-DPU data plane state](#242-npu-to-dpu-data-plane-state) - 3. [2.4.3. NPU-to-DPU tunnel counters](#243-npu-to-dpu-tunnel-counters) - 5. [2.5. NPU-to-DPU tunnel related (DPU side)](#25-npu-to-dpu-tunnel-related-dpu-side) - 6. [2.6. DPU-to-DPU data plane channel related](#26-dpu-to-dpu-data-plane-channel-related) - 7. [2.7. DPU ENI pipeline related](#27-dpu-eni-pipeline-related) -3. [3. SAI APIs](#3-sai-apis) -4. [4. CLI commands](#4-cli-commands) - -## 1. Database Schema - -NOTE: Only the configuration that is related to HA is listed here and please check [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) to see other fields. - -### 1.1. High level data flow +| 0.2 | 02/12/2024 | Riff Jiang | Added more HA mode support; Updated DB schema and workflow to match recent database and PMON design. | +| 0.3 | 03/28/2024 | Riff Jiang | Updated telemetry. | +| 0.4 | 05/06/2024 | Riff Jiang | Added drop counters for pipeline monitoring. | +| 0.5 | 06/03/2024 | Riff Jiang | Added DASH BFD probe state update workflow and DB schema. | + +1. [1. High level data flow](#1-high-level-data-flow) + 1. [1.1. Upstream config programming path](#11-upstream-config-programming-path) + 2. [1.2. State generation and handling path](#12-state-generation-and-handling-path) +2. [2. Database Schema](#2-database-schema) + 1. [2.1. External facing configuration tables](#21-external-facing-configuration-tables) + 1. [2.1.1. CONFIG\_DB (per-NPU)](#211-config_db-per-npu) + 1. [2.1.1.1. DPU / vDPU definitions](#2111-dpu--vdpu-definitions) + 2. [2.1.1.2. HA global configurations](#2112-ha-global-configurations) + 2. [2.1.2. APPL\_DB (per-NPU)](#212-appl_db-per-npu) + 1. [2.1.2.1. HA set configurations](#2121-ha-set-configurations) + 2. [2.1.2.2. HA scope configurations](#2122-ha-scope-configurations) + 3. [2.1.2.3. ENI placement table (scope = `eni` only)](#2123-eni-placement-table-scope--eni-only) + 3. [2.1.3. DPU\_APPL\_DB (per-DPU)](#213-dpu_appl_db-per-dpu) + 1. [2.1.3.1. DASH object tables](#2131-dash-object-tables) + 2. [2.2. External facing state tables](#22-external-facing-state-tables) + 1. [2.2.1. STATE\_DB (per-NPU)](#221-state_db-per-npu) + 1. [2.2.1.1. HA scope state](#2211-ha-scope-state) + 1. [2.2.1.1.1. Table key](#22111-table-key) + 2. [2.2.1.1.2. Basic information](#22112-basic-information) + 3. [2.2.1.1.3. HA related states](#22113-ha-related-states) + 4. [2.2.1.1.4. Aggregated health signals for this HA scope](#22114-aggregated-health-signals-for-this-ha-scope) + 5. [2.2.1.1.5. Ongoing HA operation state](#22115-ongoing-ha-operation-state) + 3. [2.3. Tables used by HA internally](#23-tables-used-by-ha-internally) + 1. [2.3.1. DPU\_APPL\_DB (per-DPU)](#231-dpu_appl_db-per-dpu) + 1. [2.3.1.1. HA set configurations](#2311-ha-set-configurations) + 2. [2.3.1.2. HA scope configurations](#2312-ha-scope-configurations) + 3. [2.3.1.3. Flow sync sessions](#2313-flow-sync-sessions) + 2. [2.3.2. APPL\_DB (per-NPU)](#232-appl_db-per-npu) + 1. [2.3.2.1. DASH\_ENI\_FORWARD\_TABLE](#2321-dash_eni_forward_table) + 3. [2.3.3. CHASSIS\_STATE\_DB (per-NPU)](#233-chassis_state_db-per-npu) + 1. [2.3.3.1. DPU / vDPU state](#2331-dpu--vdpu-state) + 4. [2.3.4. DPU\_STATE\_DB (per-DPU)](#234-dpu_state_db-per-dpu) + 1. [2.3.4.1. HA set state](#2341-ha-set-state) + 2. [2.3.4.2. HA scope state](#2342-ha-scope-state) + 3. [2.3.4.3. Flow sync session states](#2343-flow-sync-session-states) + 4. [2.3.4.4. DASH BFD probe state](#2344-dash-bfd-probe-state) +3. [3. Telemetry](#3-telemetry) + 1. [3.1. HA state and related health signals](#31-ha-state-and-related-health-signals) + 2. [3.2. Traffic forwarding related](#32-traffic-forwarding-related) + 1. [3.2.1. NPU-to-DPU probe status (Per-HA Scope)](#321-npu-to-dpu-probe-status-per-ha-scope) + 2. [3.2.2. NPU-to-DPU tunnel counters (Per-HA Scope)](#322-npu-to-dpu-tunnel-counters-per-ha-scope) + 3. [3.3. DPU traffic handling related](#33-dpu-traffic-handling-related) + 1. [3.3.1. DPU level counters (Per-DPU)](#331-dpu-level-counters-per-dpu) + 2. [3.3.2. ENI-level traffic counters (Per-ENI)](#332-eni-level-traffic-counters-per-eni) + 3. [3.3.3. ENI-level pipeline drop counters (Per-ENI)](#333-eni-level-pipeline-drop-counters-per-eni) + 4. [3.3.4. ENI-level flow operation counters (Per-ENI)](#334-eni-level-flow-operation-counters-per-eni) + 4. [3.4. Flow sync counters](#34-flow-sync-counters) + 1. [3.4.1. Data plane channel probing (Per-HA Set)](#341-data-plane-channel-probing-per-ha-set) + 2. [3.4.2. Inline flow sync (Per-ENI)](#342-inline-flow-sync-per-eni) + 3. [3.4.3. Bulk sync related counters (Per-HA Set)](#343-bulk-sync-related-counters-per-ha-set) +4. [4. SAI APIs](#4-sai-apis) +5. [5. CLI commands](#5-cli-commands) + +## 1. High level data flow + +On high level, the SmartSwitch HA supporting multiple modes: + +* **DPU-level passthru**: The HA is running on DPU level by DPU itself. In this mode, SmartSwitch HA control plane doesn't drive the HA state machine, but passthru the HA operations to DPU and handles the HA role change notification from DPU accordingly for state reporting and help setting up the right traffic forwarding rules when needed. +* **DPU-level active-standby**: The HA is handled on DPU level. Unlike passthru mode, SmartSwitch HA control plane will drive the HA state machine, and drives all ENIs on the same DPU to the same HA state. The traffic forwarding rule is handled on DPU level. +* **ENI-level active-standby**: The traffic forwarding rule and the HA state machine is handled on ENI level. + +The mode can be set on HA set, and all modes are sharing similar high level work flow shown as below. + +> Please note that: +> +> 1. The DPU DB in the following graphs are actually placed on the NPU side, due to CPU and memory constraints on DPU. Hence, the communication between `hamgrd` and DPU DB is local and doesn't need to go through PCIe bus. +> 2. Each DPU has its own hamgrd instance, so they can be updated independently from each other as part of DPU updates. + +### 1.1. Upstream config programming path + +The graph below shows how HA related config are programmed into the NPU and DPU DBs from our upstream services, then being handled by `hamgrd` and `swss`. ```mermaid flowchart LR @@ -59,56 +88,155 @@ flowchart LR subgraph NPU Components NPU_SWSS[swss] NPU_HAMGRD[hamgrd] + NPU_SYNCD[syncd] subgraph CONFIG DB - NPU_DPU[DPU_TABLE] - NPU_VDPU[VDPU_TABLE] - NPU_DASH_HA_GLOBAL_CONFIG[DASH_HA_GLOBAL_CONFIG] + subgraph All NPUs + NPU_DPU[DPU] + NPU_VDPU[VDPU] + NPU_DASH_HA_GLOBAL_CONFIG[DASH_HA_GLOBAL_CONFIG] + end end subgraph APPL DB + NPU_DASH_ENI_FORWARD_TABLE[DASH_ENI_FORWARD_TABLE] + NPU_VNET_ROUTE_TUNNEL_TABLE[VNET_ROUTE_TUNNEL_TABLE] + NPU_BFD_SESSION[BFD_SESSION_TABLE] - NPU_ACL_TABLE[APP_TABLE_TABLE] - NPU_ACL_RULE[APP_RULE_TABLE] + NPU_ROUTE[ROUTE_TABLE] + NPU_ACL_TABLE[ACL_TABLE_TABLE] + NPU_ACL_RULE[ACL_RULE_TABLE] + + subgraph All NPUs + NPU_DASH_HA_SET_CONFIG[DASH_HA_SET_CONFIG_TABLE] + NPU_DASH_ENI_PLACEMENT[DASH_ENI_PLACEMENT_TABLE] + end + + subgraph Owning NPU + NPU_DASH_HA_SCOPE_CONFIG[DASH_HA_SCOPE_CONFIG_TABLE] + end end + end + + subgraph "DPU0 Components (Same for other DPUs)" + DPU_SWSS[swss] + DPU_SYNCD[syncd] - subgraph DASH APPL DB - NPU_DASH_HA_SET[DASH_HA_SET_TABLE] - NPU_DASH_ENI_PLACEMENT[DASH_ENI_PLACEMENT_TABLE] - NPU_DASH_ENI_HA_CONFIG[DASH_ENI_HA_CONFIG_TABLE] + subgraph DPU APPL DB + DPU_DASH_ENI[DASH_ENI_TABLE] + DPU_DASH_HA_SET[DASH_HA_SET_TABLE] + DPU_DASH_HA_SCOPE_TABLE[DASH_HA_SCOPE_TABLE] + DPU_DASH_FLOW_SYNC_SESSION_TABLE[DASH_FLOW_SYNC_SESSION_TABLE] + DPU_BFD_SESSION_TABLE[BFD_SESSION_TABLE] end + end - subgraph STATE DB - NPU_DPU_STATE[DPU_TABLE] + %% Upstream services --> NPU northboard interfaces: + NC --> |gNMI| NPU_DPU + NC --> |gNMI| NPU_VDPU + NC --> |gNMI| NPU_DASH_HA_GLOBAL_CONFIG + + SC --> |gNMI - zmq| NPU_DASH_HA_SET_CONFIG + SC --> |gNMI - zmq| NPU_DASH_HA_SCOPE_CONFIG + SC --> |gNMI - zmq| NPU_DASH_ENI_PLACEMENT + + %% Upstream services --> DPU northboard interfaces: + SC --> |gNMI - zmq| DPU_DASH_ENI + + %% NPU tables --> NPU side SWSS: + NPU_DPU --> |SubscribeStateTable| NPU_SWSS + NPU_VDPU --> |SubscribeStateTable| NPU_SWSS + NPU_DASH_ENI_FORWARD_TABLE --> |zmq| NPU_SWSS + NPU_VNET_ROUTE_TUNNEL_TABLE --> |ConsumerStateTable| NPU_SWSS + NPU_BFD_SESSION --> |ConsumerStateTable| NPU_SWSS + NPU_ROUTE--> |ConsumerStateTable| NPU_SWSS + NPU_ACL_TABLE --> |ConsumerStateTable| NPU_SWSS + NPU_ACL_RULE --> |ConsumerStateTable| NPU_SWSS + + %% NPU side SWSS --> NPU tables: + NPU_SWSS --> |ProducerStateTable| NPU_BFD_SESSION + NPU_SWSS --> |ProducerStateTable| NPU_ROUTE + NPU_SWSS --> |ProducerStateTable| NPU_ACL_TABLE + NPU_SWSS --> |ProducerStateTable| NPU_ACL_RULE + NPU_SWSS --> NPU_SYNCD + + %% NPU tables --> hamgrd: + NPU_DPU --> |SubscribeStateTable| NPU_HAMGRD + NPU_VDPU --> |SubscribeStateTable| NPU_HAMGRD + NPU_DASH_HA_GLOBAL_CONFIG --> |SubscribeStateTable| NPU_HAMGRD + NPU_DASH_HA_SET_CONFIG --> |zmq| NPU_HAMGRD + NPU_DASH_ENI_PLACEMENT --> |zmq| NPU_HAMGRD + NPU_DASH_HA_SCOPE_CONFIG --> |zmq| NPU_HAMGRD + + %% hamgrd --> NPU tables: + NPU_HAMGRD --> |zmq| NPU_DASH_ENI_FORWARD_TABLE + NPU_HAMGRD --> |ProducerStateTable| NPU_VNET_ROUTE_TUNNEL_TABLE + + %% hamgrd --> DPU tables: + NPU_HAMGRD --> |zmq| DPU_DASH_HA_SET + NPU_HAMGRD --> |zmq| DPU_DASH_HA_SCOPE_TABLE + NPU_HAMGRD --> |zmq| DPU_DASH_FLOW_SYNC_SESSION_TABLE + NPU_HAMGRD --> |zmq| DPU_BFD_SESSION_TABLE + + %% DPU tables --> DPU SWSS: + DPU_DASH_ENI --> |zmq| DPU_SWSS + DPU_DASH_HA_SET --> |zmq| DPU_SWSS + DPU_DASH_HA_SCOPE_TABLE --> |zmq| DPU_SWSS + DPU_DASH_FLOW_SYNC_SESSION_TABLE --> |zmq| DPU_SWSS + DPU_BFD_SESSION_TABLE --> |zmq| DPU_SWSS + + %% DPU SWSS -> DPU syncd + DPU_SWSS --> DPU_SYNCD +``` + +### 1.2. State generation and handling path + +Whenever device or data path state is changed, we will need to handle them and update the HA related setup accordingly. + +The workflow below shows the high level data flow for handling the state changes. There are 2 main paths: + +1. BFD probe state change path: When the BFD probe state is changed, `swss` will update the route or ACL rules accordingly to point the traffic to the reachable DPU. +2. DPU state change path: When `pmon` is reporting device issues or we detected data path problem from counters, `hamgrd` will update the HA state machine and the traffic forwarding rules accordingly. + +```mermaid +flowchart LR + SC[SDN Controllers] + + subgraph NPU Components + NPU_SWSS[swss] + NPU_HAMGRD[hamgrd] + NPU_PMON[pmon] + NPU_SYNCD[syncd] + + subgraph APPL DB + NPU_ACL_RULE[ACL_RULE_TABLE] + NPU_DASH_ENI_FORWARD_TABLE[DASH_ENI_FORWARD_TABLE] + NPU_VNET_ROUTE_TUNNEL_TABLE[VNET_ROUTE_TUNNEL_TABLE] + end + + subgraph CHASSIS STATE DB + NPU_DPU_STATE[DPU_STATE] NPU_VDPU_STATE[VDPU_TABLE] - NPU_BFD_SESSION_STATE[BFD_SESSION_TABLE] - NPU_ACL_TABLE_STATE[APP_TABLE_TABLE] - NPU_ACL_RULE_STATE[APP_RULE_TABLE] end - subgraph DASH STATE DB - NPU_DASH_DPU_HA_STATE[DASH_DPU_HA_STATE_TABLE] - NPU_DASH_VDPU_HA_STATE[DASH_VDPU_HA_STATE_TABLE] - NPU_DASH_ENI_HA_STATE[DASH_ENI_HA_STATE_TABLE] - NPU_DASH_ENI_DP_STATE[DASH_ENI_DP_STATE_TABLE] + subgraph STATE DB + NPU_BFD_SESSION_STATE[BFD_SESSION_TABLE] end - subgraph DASH COUNTER DB - NPU_DASH_COUNTERS[DASH_*_COUNTER_TABLE] + subgraph DASH STATE DB + NPU_DASH_HA_SCOPE_STATE[DASH_HA_SCOPE_STATE] end end subgraph "DPU0 Components (Same for other DPUs)" DPU_SWSS[swss] - - subgraph DASH APPL DB - DPU_DASH_HA_SET[DASH_HA_SET_TABLE] - DPU_DASH_ENI[DASH_ENI_TABLE] - DPU_DASH_ENI_HA_BULK_SYNC_SESSION[DASH_ENI_HA_BULK_SYNC_SESSION_TABLE] - end + DPU_SYNCD[syncd] subgraph DASH STATE DB - DPU_DASH_ENI_HA_STATE[DASH_ENI_HA_STATE_TABLE] + DPU_DASH_HA_SET_STATE[DASH_HA_SET_STATE] + DPU_DASH_HA_SCOPE_STATE[DASH_HA_SCOPE_STATE] + DPU_DASH_FLOW_SYNC_SESSION_STATE[DASH_FLOW_SYNC_SESSION_STATE] + DPU_DASH_BFD_PROBE_STATE[DASH_BFD_PROBE_STATE] end subgraph DASH COUNTER DB @@ -116,81 +244,74 @@ flowchart LR end end - %% Upstream services --> northbound interfaces: - NC --> |gNMI| NPU_DPU - NC --> |gNMI| NPU_VDPU - NC --> |gNMI| NPU_DASH_HA_GLOBAL_CONFIG + %% NPU side PMON --> NPU tables: + NPU_PMON --> NPU_DPU_STATE + NPU_PMON --> NPU_VDPU_STATE - SC --> |gNMI| NPU_DASH_HA_SET - SC --> |gNMI| NPU_DASH_ENI_PLACEMENT - SC --> |gNMI| NPU_DASH_ENI_HA_CONFIG - SC --> |gNMI| DPU_DASH_ENI + %% NPU tables --> hamgrd: + NPU_DPU_STATE --> |Direct Table Query| NPU_HAMGRD + NPU_VDPU_STATE --> |Direct Table Query| NPU_HAMGRD %% NPU tables --> NPU side SWSS: - NPU_DPU --> NPU_SWSS - NPU_VDPU --> NPU_SWSS - NPU_BFD_SESSION --> |ConsumerStateTable| NPU_SWSS - NPU_ACL_TABLE --> |ConsumerStateTable| NPU_SWSS + NPU_BFD_SESSION_STATE --> |ConsumerStateTable| NPU_SWSS NPU_ACL_RULE --> |ConsumerStateTable| NPU_SWSS + NPU_DASH_ENI_FORWARD_TABLE --> |zmq| NPU_SWSS + NPU_VNET_ROUTE_TUNNEL_TABLE --> |ConsumerStateTable| NPU_SWSS %% NPU side SWSS --> NPU tables: - NPU_SWSS --> NPU_DPU_STATE - NPU_SWSS --> NPU_VDPU_STATE - NPU_SWSS --> NPU_BFD_SESSION_STATE - NPU_SWSS --> NPU_ACL_TABLE_STATE - NPU_SWSS --> NPU_ACL_RULE_STATE - NPU_SWSS --> |Forward BFD Update| NPU_HAMGRD - - %% NPU tables --> hamgrd: - NPU_DPU --> NPU_HAMGRD - NPU_VDPU --> NPU_HAMGRD - NPU_DPU_STATE --> |Direct Table Query| NPU_HAMGRD - NPU_VDPU_STATE --> |Direct Table Query| NPU_HAMGRD - NPU_DASH_HA_GLOBAL_CONFIG --> NPU_HAMGRD - NPU_DASH_HA_SET --> NPU_HAMGRD - NPU_DASH_ENI_PLACEMENT --> NPU_HAMGRD - NPU_DASH_ENI_HA_CONFIG --> NPU_HAMGRD - NPU_DASH_COUNTERS --> |Direct Table Query| NPU_HAMGRD + NPU_SWSS --> NPU_SYNCD + NPU_SWSS --> |ProducerStateTable| NPU_BFD_SESSION_STATE + NPU_SWSS --> |ProducerStateTable| NPU_ACL_RULE %% DPU tables --> hamgrd: - DPU_DASH_ENI_HA_STATE --> NPU_HAMGRD + DPU_DASH_HA_SET_STATE --> |zmq| NPU_HAMGRD + DPU_DASH_HA_SCOPE_STATE --> |zmq| NPU_HAMGRD + DPU_DASH_FLOW_SYNC_SESSION_STATE --> |zmq| NPU_HAMGRD + DPU_DASH_BFD_PROBE_STATE --> |SubscribeStateTable| NPU_HAMGRD DPU_DASH_COUNTERS --> |Direct Table Query| NPU_HAMGRD %% hamgrd --> NPU tables: - NPU_HAMGRD --> NPU_DASH_DPU_HA_STATE - NPU_HAMGRD --> NPU_DASH_VDPU_HA_STATE - NPU_HAMGRD --> NPU_DASH_ENI_HA_STATE - NPU_HAMGRD --> NPU_DASH_ENI_DP_STATE - NPU_HAMGRD --> |ProducerStateTable| NPU_BFD_SESSION - NPU_HAMGRD --> |ProducerStateTable| NPU_ACL_TABLE - NPU_HAMGRD --> |ProducerStateTable| NPU_ACL_RULE + NPU_HAMGRD --> |zmq| NPU_DASH_ENI_FORWARD_TABLE + NPU_HAMGRD --> |ProducerStateTable| NPU_VNET_ROUTE_TUNNEL_TABLE + NPU_HAMGRD --> NPU_DASH_HA_SCOPE_STATE %% hamgrd --> DPU tables: - NPU_HAMGRD --> DPU_DASH_HA_SET - NPU_HAMGRD --> DPU_DASH_ENI_HA_BULK_SYNC_SESSION %% DPU tables --> DPU SWSS: - DPU_DASH_ENI --> DPU_SWSS - DPU_DASH_HA_SET --> DPU_SWSS - DPU_DASH_ENI_HA_BULK_SYNC_SESSION --> DPU_SWSS + DPU_SYNCD --> |SAI events
SAI callbacks| DPU_SWSS %% DPU swss --> DPU tables: - DPU_SWSS --> DPU_DASH_ENI_HA_STATE + DPU_SWSS --> |zmq| DPU_DASH_HA_SET_STATE + DPU_SWSS --> |zmq| DPU_DASH_HA_SCOPE_STATE + DPU_SWSS --> |zmq| DPU_DASH_FLOW_SYNC_SESSION_STATE + DPU_SWSS --> |Direct Write| DPU_DASH_BFD_PROBE_STATE DPU_SWSS --> DPU_DASH_COUNTERS + + %% NPU state tables --> Upstream service: + NPU_DASH_HA_SCOPE_STATE --> |gNMI| SC ``` -### 1.2. NPU DB schema +## 2. Database Schema -#### 1.2.1. CONFIG DB +> NOTE: +> +> * Only the configuration that is related to HA is listed here and please check [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) to see other fields. +> * Although the per-DPU database is for each DPU, but they are not running on DPU side, but on NPU side. The communication between `hamgrd` and DPU DB is local and doesn't need to go through PCIe bus. -##### 1.2.1.1. DPU / vDPU definitions +### 2.1. External facing configuration tables + +The following tables will be programmed either by SDN controller or by the network controller to enable HA functionality in SmartSwitch. + +#### 2.1.1. CONFIG_DB (per-NPU) + +##### 2.1.1.1. DPU / vDPU definitions * These tables are imported from the SmartSwitch HLD to make the doc more convenient for reading, and we should always use that doc as the source of truth. * These tables should be prepopulated before any HA configuration tables below are programmed. | Table | Key | Field | Description | | --- | --- | --- | --- | -| DPU_TABLE | | | Physical DPU configuration. | +| DPU | | | Physical DPU configuration. | | | \ | | Physical DPU ID | | | | type | Type of DPU. It can be "local", "cluster" or "external". | | | | state | Admin state of the DPU device. | @@ -200,13 +321,13 @@ flowchart LR | | | npu_ipv4 | IPv4 address of its owning NPU loopback. | | | | npu_ipv6 | IPv6 address of its owning NPU loopback. | | | | probe_ip | Custom probe point if we prefer to use a different one from the DPU IP address. | -| VDPU_TABLE | | | Virtual DPU configuration. | +| VDPU | | | Virtual DPU configuration. | | | \ | | Virtual DPU ID | | | | profile | The profile of the vDPU. | | | | tier | The tier of the vDPU. | | | | main_dpu_ids | The IDs of the main physical DPU. | -##### 1.2.1.2. HA global configurations +##### 2.1.1.2. HA global configurations * The global configuration is shared by all HA sets, and ENIs and should be programmed on all switches. * The global configuration should be programmed before any HA set configurations below are programmed. @@ -214,36 +335,58 @@ flowchart LR | Table | Key | Field | Description | | --- | --- | --- | --- | | DASH_HA_GLOBAL_CONFIG | N/A | | HA global configurations. | -| | | dp_channel_dst_port | The destination port used when tunneling packetse via DPU-to-DPU data plane channel. | -| | | dp_channel_src_port_min | The min source port used when tunneling packetse via DPU-to-DPU data plane channel. | -| | | dp_channel_src_port_max | The max source port used when tunneling packetse via DPU-to-DPU data plane channel. | +| | | cp_data_channel_port | The port of control plane data channel, used for bulk sync. | +| | | dp_channel_dst_port | The destination port used when tunneling packets via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_min | The min source port used when tunneling packets via DPU-to-DPU data plane channel. | +| | | dp_channel_src_port_max | The max source port used when tunneling packets via DPU-to-DPU data plane channel. | | | | dp_channel_probe_interval_ms | The interval of sending each DPU-to-DPU data path probe. | -| | | dpu_bfd_probe_multiplier | The number of DPU BFD probe failure before probe down. | +| | | dp_channel_probe_fail_threshold | The number of probe failure needed to consider data plane channel is dead. | | | | dpu_bfd_probe_interval_in_ms | The interval of DPU BFD probe in milliseconds. | +| | | dpu_bfd_probe_multiplier | The number of DPU BFD probe failure before probe down. | -#### 1.2.2. APPL DB +#### 2.1.2. APPL_DB (per-NPU) -##### 1.2.2.1. HA set configurations +##### 2.1.2.1. HA set configurations -* The HA set table defines which DPUs should be forming the same HA set and how. -* The HA set table should be programmed on all switches, so we could program the ENI location information and setup the traffic forwarding rules. -* If the HA set contains local vDPU, it will be copied to DPU side DB by `hamgrd` as well. +* The HA set table defines the vDPUs are used in this HA set and its mode, such as HA owner and scope. +* The HA set table should be programmed on all switches, so we can use it to create the traffic forwarding rules on the NPU side. +* If any vDPU in the HA set is local, `hamgrd` will send the HA set information to DPU, so DPU can start pairing with its peer DPU and set up the data plane channel. | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_HA_SET_TABLE | | | HA set table, which describes the DPUs that forms the HA set. | +| DASH_HA_SET_CONFIG_TABLE | | | HA set config table, which describes the DPUs that forms the HA set. | | | \ | | HA set ID | | | | version | Config version. | +| | | vip_v4 | IPv4 Data path VIP. | +| | | vip_v6 | IPv6 Data path VIP. | +| | | owner | Owner/Driver of HA state machine. It can be `dpu`, `switch`. | +| | | scope | HA scope. It can be `dpu`, `eni`. | | | | vdpu_ids | The ID of the vDPUs. | -| | | mode | Mode of HA set. It can be "activestandby". | -| | | pinned_vdpu_bfd_probe_states | Pinned probe states of vDPUs, connected by ",". Each state can be "" (none), "up" or "down". | -| | | preferred_standalone_vdpu_index | Preferred vDPU index to be standalone when entering into standalone setup. | +| | | pinned_vdpu_bfd_probe_states | Pinned probe states of vDPUs, connected by ",". Each state can be "" (none), `up` or `down`. | +| | | preferred_vdpu_id | When preferred vDPU ID is set, the traffic will be forwarded to this vDPU when both BFD probes are up. | +| | | preferred_standalone_vdpu_index | (scope = `eni` only)

Preferred vDPU index to be standalone when entering into standalone setup. | -##### 1.2.2.2. ENI placement configurations +##### 2.1.2.2. HA scope configurations +* The HA scope configuration table is programmed by SDN controller and contains the HA config for each HA scope (DPU or ENI) that only lands on this specific switch. +* When HA scope is set to `dpu` in HA set, SmartSwitch HA will use the HA set id as the HA scope id, otherwise, HA scope id will be the ENI id. + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SCOPE_CONFIG_TABLE | | | HA scope configuration. | +| | \ | | VDPU ID. | +| | \ | | HA scope ID. It can be the HA set id (scope = `dpu`) or ENI id (scope = `eni`) | +| | | version | Config version. | +| | | disabled | If true, disable this vDPU. It can only be `false` or `true`. | +| | | desired_ha_state | The desired state for this vDPU. It can only be "" (none), `dead`, `active` or `standalone`. | +| | | approved_pending_operation_ids | Approved pending HA operation id list, connected by "," | + +##### 2.1.2.3. ENI placement table (scope = `eni` only) + +* The ENI placement table is used when HA scope is set to `eni`. * The ENI placement table defines which HA set this ENI belongs to, and how to forward the traffic. * The ENI placement table should be programmed on all switches. -* Once this table is programmed, `hamgrd` will generate the BFD +* Once this table is programmed, `hamgrd` will generate the routing configurations to `swss` for enable ENI level forwarding. | Table | Key | Field | Description | | --- | --- | --- | --- | @@ -252,363 +395,390 @@ flowchart LR | | | version | Config version. | | | | eni_mac | ENI mac address. Used to create the NPU side ACL rules to match the incoming packets and forward to the right DPUs. | | | | ha_set_id | The HA set ID that this ENI is allocated to. | -| | | pinned_next_hop_index | The index of the pinned next hop DPU for this ENI forwarding rule. "" = Not set. | +| | | pinned_next_hop_index | The index of the pinned next hop DPU for this ENI traffic forwarding rule. "" = Not set. | + +#### 2.1.3. DPU_APPL_DB (per-DPU) -##### 1.2.2.3. ENI configurations +##### 2.1.3.1. DASH object tables -* The ENI HA configuration table contains the ENI-level HA config. -* The ENI HA configuraiton table only contains the ENIs that is hosted on the local switch. +* The DASH objects will only be programmed on the DPU that is hosting the ENIs. | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_ENI_HA_CONFIG_TABLE | | | ENI HA configuration. | -| | \ | | vDPU ID. Used to identifying a single vDPU. | -| | \ | | ENI ID. Used for identifying a single ENI. | -| | | version | Config version. | -| | | desired_ha_state | The desired state for this ENI. It can only be "" (none), dead, active or standalone. | -| | | approved_pending_operation_request_id | Approved pending approval operation ID, e.g. switchover operation. | +| DASH_ENI_TABLE | | | HA configuration for each ENI. | +| | \ | | ENI ID. Used to identifying a single ENI. | +| | | admin_state | Admin state of each DASH ENI. To support control from HA, `STATE_HA_ENABLED` is added. | +| | | ha_scope_id | HA scope id. It can be the HA set id (scope = `dpu`) or ENI id (scope = `eni`) | +| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | -#### 1.2.3. State DB +### 2.2. External facing state tables -##### 1.2.3.1. DPU / vDPU state +#### 2.2.1. STATE_DB (per-NPU) -DPU/vDPU state table stores the health states of each DPU/vDPU. These data are collected by `pmon`. +##### 2.2.1.1. HA scope state -| Table | Key | Field | Description | -| --- | --- | --- | --- | -| DPU_TABLE | | | Physical DPU state. | -| | \ | | Physical DPU ID | -| | | health_state | Health state of the DPU device. It can be "healthy", "unhealthy". Only valid when the DPU is local. | -| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | -| VDPU_TABLE | | | Virtual DPU state. | -| | \ | | Virtual DPU ID | -| | | health_state | Health state of the vDPU. It can be "healthy", "unhealthy". Only valid when the vDPU is local. | -| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | +To show the current state of HA, the states will be aggregated by `hamgrd` and store in the HA scope table as below. -#### 1.2.4. DASH State DB +> Because this state table is relatively large, the fields are splited into a few sections below. -##### 1.2.4.1. DPU / vDPU HA states +###### 2.2.1.1.1. Table key | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_HA_DPU_STATE_TABLE | | | HA related Physical DPU state. | -| | \ | | Physical DPU ID | -| | | card_level_probe_state | Card level probe state. It can be "unknown", "up", "down". | -| DASH_HA_VDPU_STATE_TABLE | | | HA related Virtual DPU state. | -| | \ | | Virtual DPU ID | -| | | card_level_probe_state | Card level probe state. It can be "unknown", "up", "down". | +| DASH_HA_SCOPE_STATE | | | The state of each HA scope (vDPU or ENI) that is hosted on local switch. | +| | \ | | VDPU ID. Used to identifying a single VDPU. | +| | \ | | HA scope ID. It can be the HA set id (scope = `dpu`) or ENI id (scope = `eni`) | -##### 1.2.4.2. ENI state +###### 2.2.1.1.2. Basic information -On NPU side, the ENI state table shows: +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| | | creation_time_in_ms | HA scope creation time in milliseconds. | +| | | last_heartbeat_time_in_ms | Last heartbeat time in milliseconds. This is used for leak detection. Heartbeat time happens once per minute and will not change the last state updated time. | +| | | vip_v4 | Data path VIP of the DPU or ENI. | +| | | vip_v6 | Data path VIP of the DPU or ENI. | +| | | local_ip | The IP of local DPU. | +| | | peer_ip | The IP of peer DPU. | -* The HA state of each local ENI. -* The traffic forwarding state of all known ENIs. +###### 2.2.1.1.3. HA related states | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_ENI_HA_STATE_TABLE | | | Data plane state of each ENI that is hosted on local switch. | -| | \ | | VDPU ID. Used to identifying a single VDPU. | -| | \ | | ENI ID. Used to identifying a single ENI. | -| | | creation_time_is_ms | ENI creation time in milliseconds. | -| | | last_heartbeat_time_in_ms | ENI last heartbeat time in milliseconds. Heartbeat time happens once per minute and will not change the last state updated time. | -| | | last_state_updated_time_in_ms | ENI state last updated time in milliseconds. | -| | | data_path_vip | Data path VIP of the ENI. | | | | local_ha_state | The state of the HA state machine. This is the state in NPU hamgrd. | -| | | local_ha_state_last_update_time_in_ms | The time when local target HA state is set. | -| | | local_ha_state_last_update_reason | The reason of the last HA state change. | +| | | local_ha_state_last_updated_time_in_ms | The time when local target HA state is set. | +| | | local_ha_state_last_updated_reason | The reason of the last HA state change. | | | | local_target_asic_ha_state | The target HA state in ASIC. This is the state that hamgrd generates and asking DPU to move to. | | | | local_acked_asic_ha_state | The HA state that ASIC acked. | | | | local_target_term | The current target term of the HA state machine. | | | | local_acked_term | The current term that acked by ASIC. | -| | | local_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | -| | | peer_ip | The IP of peer DPU. | | | | peer_ha_state | The state of the HA state machine in peer DPU. | | | | peer_term | The current term in peer DPU. | -| | | peer_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | -| | | ha_operation_type | HA operation type, e.g., "switchover". | -| | | ha_operation_id | HA operation ID (GUID). | -| | | ha_operation_state | HA operation state. It can be "created", "pendingapproval", "approved", "inprogress" | -| | | ha_operation_start_time_in_ms | The time when operation is created. | -| | | ha_operation_state_last_update_time_in_ms | The time when operation state is updated last time. | -| | | bulk_sync_start_time_in_ms | Bulk sync start time in milliseconds. | -| DASH_ENI_DP_STATE_TABLE | | | Data plane state of all known ENI. | -| | \ | | ENI ID. Used to identifying a single ENI. | -| | | ha_set_mode | HA set mode. See [HA set configurations](#1221-ha-set-configurations) for more details. | -| | | next_hops | All possible next hops for this ENI. | -| | | next_hops_types | Type of each next hops, connected by ",". | -| | | next_hops_card_level_probe_states | Card level probe state for each next hop, connected by ",". It can be "unknown", "up", "down". | -| | | next_hops_active_states | Is next hop set as active the ENI HA state machine. It can be "unknown", "true", "false". | -| | | next_hops_final_state | Final state for each next hops, connected by ",". It can be "up", "down". | -### 1.3. DPU DB schema +###### 2.2.1.1.4. Aggregated health signals for this HA scope -#### 1.3.1. APPL DB +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| | | local_vdpu_midplane_state | The state of local vDPU midplane. The value can be "unknown", "up", "down". | +| | | local_vdpu_midplane_state_last_updated_time_in_ms | Local vDPU midplane state last updated time in milliseconds. | +| | | local_vdpu_control_plane_state | The state of local vDPU control plane, which includes DPU OS and certain required firmware. The value can be "unknown", "up", "down". | +| | | local_vdpu_control_plane_state_last_updated_time_in_ms | Local vDPU control plane state last updated time in milliseconds. | +| | | local_vdpu_data_plane_state | The state of local vDPU data plane, which includes DPU hardware / ASIC and certain required firmware. The value can be "unknown", "up", "down". | +| | | local_vdpu_data_plane_state_last_updated_time_in_ms | Local vDPU data plane state last updated time in milliseconds. | +| | | local_vdpu_up_bfd_sessions_v4 | The list of IPv4 peer IPs (NPU IP) of the BFD sessions in up state. | +| | | local_vdpu_up_bfd_sessions_v4_update_time_in_ms | Local vDPU BFD sessions v4 last updated time in milliseconds. | +| | | local_vdpu_up_bfd_sessions_v6 | The list of IPv6 peer IPs (NPU IP) of the BFD sessions in up state. | +| | | local_vdpu_up_bfd_sessions_v6_update_time_in_ms | Local vDPU BFD sessions v6 last updated time in milliseconds. | + +###### 2.2.1.1.5. Ongoing HA operation state + +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| | | pending_operation_ids | GUIDs of pending operation IDs, connected by "," | +| | | pending_operation_types | Type of pending operations, e.g. "switchover", "activate_role", "flow_reconcile", "brainsplit_recover". Connected by "," | +| | | pending_operation_list_last_updated_time | Last updated time of the pending operation list. | +| | | switchover_id | Switchover ID (GUID). | +| | | switchover_state | Switchover state. It can be "pendingapproval", "approved", "inprogress", "completed", "failed" | +| | | switchover_start_time_in_ms | The time when operation is created. | +| | | switchover_end_time_in_ms | The time when operation is ended. | +| | | switchover_approved_time_in_ms | The time when operation is approved. | +| | | flow_sync_session_id | Flow sync session ID. | +| | | flow_sync_session_state | Flow sync session state. It can be "inprogress", "completed", "failed" | +| | | flow_sync_session_start_time_in_ms | Flow sync start time in milliseconds. | +| | | flow_sync_session_target_server | The IP endpoint of the server that flow records are sent to. | + +### 2.3. Tables used by HA internally -##### 1.3.1.1. HA set configurations +#### 2.3.1. DPU_APPL_DB (per-DPU) -If any HA set configuration is related to local DPU, it will be parsed and being programmed to the DPU side DB, which will be translated to SAI API calls and sent to ASIC by DPU side swss. +##### 2.3.1.1. HA set configurations + +When a HA set configuration on NPU side contains a local DPU, `hamgrd` will create the HA set configuration and send it to DPU for programming the ASIC by DPU side swss. | Table | Key | Field | Description | | --- | --- | --- | --- | | DASH_HA_SET_TABLE | | | HA set table, which describes the DPUs that forms the HA set. | | | \ | | HA set ID | | | | version | Config version. | -| | | mode | Mode of HA set. It can be "activestandby". | -| | | peer_dpu_ipv4 | The IPv4 address of peer DPU. | -| | | peer_dpu_ipv6 | The IPv6 address of peer DPU. | +| | | vip_v4 | IPv4 Data path VIP. | +| | | vip_v6 | IPv6 Data path VIP. | +| | | owner | Owner of HA state machine. It can be `controller`, `switch`. | +| | | scope | Scope of HA set. It can be `dpu`, `eni`. | +| | | local_npu_ip | The IP address of local NPU. It can be IPv4 or IPv6. Used for setting up the BFD session. | +| | | local_ip | The IP address of local DPU. It can be IPv4 or IPv6. | +| | | peer_ip | The IP address of peer DPU. It can be IPv4 or IPv6. | +| | | cp_data_channel_port | The port of control plane data channel, used for bulk sync. | | | | dp_channel_dst_port | The destination port used when tunneling packetse via DPU-to-DPU data plane channel. | | | | dp_channel_src_port_min | The min source port used when tunneling packetse via DPU-to-DPU data plane channel. | | | | dp_channel_src_port_max | The max source port used when tunneling packetse via DPU-to-DPU data plane channel. | | | | dp_channel_probe_interval_ms | The interval of sending each DPU-to-DPU data path probe. | +| | | dp_channel_probe_fail_threshold | The number of probe failure needed to consider data plane channel is dead. | -##### 1.3.1.2. DASH ENI object table +##### 2.3.1.2. HA scope configurations -* The DASH objects will only be programmed on the DPU that is hosting the ENIs. +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SCOPE_TABLE | | | HA scope configuration. | +| | \ | | HA scope ID. It can be the HA set id (scope = `dpu`) or ENI id (scope = `eni`) | +| | | version | Config version. | +| | | disabled | If true, disable this vDPU. It can only be `false` or `true`. | +| | | ha_role | The HA role for this scope. It can only be `dead`, `active`, `standby`, `standalone`, `switching_to_active`. | +| | | flow_reconcile_requested | If true, flow reconcile will be initiated. (Message Only. Not saved in DB.) | +| | | activate_role_requested | If true, HA role will be activated. (Message Only. Not saved in DB.) | + +##### 2.3.1.3. Flow sync sessions | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_ENI_TABLE | | | HA configuration for each ENI. | -| | \ | | ENI ID. Used to identifying a single ENI. | +| DASH_FLOW_SYNC_SESSION_TABLE | | | | +| | \ | | Flow sync session id. | | | | ha_set_id | HA set id. | -| | | ha_role | HA role. It can be "dead", "active", "standby", "standalone", "switching_to_active" | -| | | ... | see [SONiC-DASH HLD](https://github.com/sonic-net/SONiC/blob/master/doc/dash/dash-sonic-hld.md) for more details. | -| DASH_ENI_HA_BULK_SYNC_SESSION_TABLE | | | HA bulk sync session table. | -| | \ | | ENI ID. Used to identifying a single ENI. | -| | | session_id | Bulk sync session id. | -| | | peer_bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | +| | | target_server_ip | The IP of the server that used to receive flow records. | +| | | target_server_port | The port of the server that used to receive flow records. | + +#### 2.3.2. APPL_DB (per-NPU) + +##### 2.3.2.1. DASH_ENI_FORWARD_TABLE -#### 1.3.2. State DB +| Table | Key | Field | Description | Value Format | +| --- | --- | --- | --- | --- | +| DASH_ENI_FORWARD_TABLE | | | | | +| | \ | | VNET name. Used to correlate the VNET table to find VNET info, such as advertised VIP. | {{vnet_name}} | +| | \ | | ENI ID. Same as the MAC address of the ENI. | {{eni_id}} | +| | | vdpu_ids | The list vDPU IDs hosting this ENI. | {{vdpu_id1},{vdpu_id2},...} | +| | | primary_vdpu | The primary vDPU id. | {{dpu_id}} | +| | | outbound_vni | (Optional) Outbound VNI used by this ENI, if different from the one in VNET. Each ENI can have its own VNI, such ExpressRoute Gateway Bypass case. | {{vni}} | +| | | outbound_eni_mac_lookup | (Optional) Specify which MAC address to use to lookup the ENI for the outbound traffic. | "" (default), "dst", "src" | -##### 1.3.2.1. ENI HA state +#### 2.3.3. CHASSIS_STATE_DB (per-NPU) -* The ENI HA state table contains the ENI-level HA state. -* The ENI HA state table only contains the ENIs that is hosted on the local DPU. +##### 2.3.3.1. DPU / vDPU state + +DPU state table stores the health states of each DPU. These data are collected by `pmon`. | Table | Key | Field | Description | | --- | --- | --- | --- | -| DASH_ENI_HA_STATE_TABLE | | | HA state of each ENI that is hosted on local DPU. | -| | \ | | ENI ID. Used to identifying a single ENI. | -| | | ha_role | The current HA role confirmed by ASIC. It can be "dead", "active", "standby", "standalone", "switching_to_active" | -| | | term | The current term confirmed by ASIC. | -| | | ha_role_last_update_time | The time when HA role is last updated in milliseconds. | -| | | bulk_sync_recv_server_endpoints | The IP endpoints that used to receive flow records during bulk sync, connected by ",". | -| | | ongoing_bulk_sync_session_id | Ongoing bulk sync session id. | -| | | ongoing_bulk_sync_session_start_time_in_ms | Ongoing bulk sync session start time in milliseconds. | - -## 2. Telemetry +| DPU_STATE | | | Physical DPU state. | +| | \ | | Physical DPU ID | +| | | ... | see [SmartSwitch PMON HLD](https://github.com/sonic-net/SONiC/pull/1584) for more details. | +| VDPU_STATE | | | vDPU state. | +| | \ | | vDPU ID | +| | | ... | Placeholder for future vDPU usage. It should follow the same PMON design for DPU. | -To properly monitor the HA related features, we will need to add telemetry for monitoring it. +#### 2.3.4. DPU_STATE_DB (per-DPU) -The telemetry will cover both state and counters, which can be mapped into `DASH_STATE_DB` or `DASH_COUNTER_DB`. +##### 2.3.4.1. HA set state -* For ENI level states and counters in NPU DB, we will have `VDPU_ID` in the key as well as the `ENI_ID` to make each counter unique, because ENI migration from one DPU to another on the same switch. -* For ENI level states and counters in DPU DB, we don’t need to have `VDPU_ID` in the key, because they are tied to a specific DPU, and we should know which DPU it is during logging. +* The HA set state table contains the state of each HA set. -We will focus on only the HA counters below, which will not include basic counters, such as ENI creation/removal or generic DPU health/critical event counters, even though some of them works closely with HA workflows. +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SET_STATE | | | State of each HA set. | +| | \ | | HA set ID | +| | | last_updated_time | The last update time of this state in milliseconds. | +| | | dp_channel_is_alive | Data plane channel is alive or not. | -### 2.1. HA state +##### 2.3.4.2. HA scope state -First of all, we need to store the HA states for us to check. +* The HA scope state table contains the HA state of each HA scope. + * When ENI-level HA is used, it shows the HA state of each ENI that is hosted on the local DPU. + * When DPU-level HA is used, it shows the HA state of the local DPU. -Please refer to the [ENI state](#1242-eni-state) table in NPU DB for detailed DB schema design. +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_HA_SCOPE_STATE | | | State of each HA scope. | +| | \ | | HA scope ID. It can be the HA set ID or ENI ID, depending on the which HA mode is used. | +| | | last_updated_time | The last update time of this state in milliseconds. | +| | | ha_role | The current HA role confirmed by ASIC. Please refer to the HA states defined in HA HLD. | +| | | ha_role_start_time | The time when HA role is moved into current one in milliseconds. | +| | | ha_term | The current term confirmed by ASIC. | +| | | activate_role_pending | DPU is pending on role activation. | +| | | flow_reconcile_pending | Flow reconcile is requested and pending approval. | +| | | brainsplit_recover_pending | Brainsplit is detected, and DPU is pending on recovery. | + +##### 2.3.4.3. Flow sync session states -### 2.2. HA operation counters +| Table | Key | Field | Description | +| --- | --- | --- | --- | +| DASH_FLOW_SYNC_SESSION_STATE | | | | +| | \ | | Flow sync session id. | +| | | state | Flow sync session state. It can be "created", "inprogress", "completed", "failed". | +| | | creation_time_in_ms | Flow sync session creation time in milliseconds. | +| | | last_state_start_time_in_ms | Flow sync session last state start time in milliseconds. | -Besides the HA states, we also need to log all the operations that is related to HA. +##### 2.3.4.4. DASH BFD probe state -HA operations are mostly lies in 2 places: `hamgrd` for operations coming from northbound interfaces and syncd for SAI APIs we call or SAI notification we handle related to HA. +The schema of `DASH_BFD_PROBE_STATE` table is defined in the [SmartSwitch BFD HLD](https://github.com/sonic-net/SONiC/pull/1635). Please refer to it for detailed definition. -#### 2.2.1. hamgrd HA operation counters +## 3. Telemetry -All the HA operation counters will be: +To properly monitor the HA related features, we provides the following telemetry for monitoring it. -* Saved in NPU side `COUNTER_DB`, since the `hamgrd` is running on NPU side. -* Partitioned with ENI level key: `DASH_HA_OP_STATS||`. +The telemetry will cover both state and counters, which can be mapped into `DPU_STATE_DB` or `DPU_COUNTERS_DB`. -| Name | Description | -| --- | --- | -| **state_enter*(req/success/failure)_count | Number of state transitions we have done (Request/Succeeded Request/Failed request). | -| total_(successful/failed)_*_state_enter_time_in_us | The total time we used to transit to specific state in microseconds. Successful and failed transitions need to be tracked separately, as they will have different patterns. | -| switchover_(req/success/failure)_count | Similar as above, but for switchover operations. | -| total_(successful/failed)_switchover_time_in_us | Similar as above, but for switchover operations. | -| shutdown_standby_(req/success/failure)_count | Similar as above, but for shutdown standby operations. | -| total_(successful/failed)_shutdown_standby_time_in_us | Similar as above, but for shutdown standby operations. | -| shutdown_self_(req/success/failure)_count | Similar as above, but for force shutdown operations. | -| total_(successful/failed)_shutdown_self_time_in_us | Similar as above, but for force shutdown operations. | +We will focus on only the HA counters below, which will not include basic counters, such as ENI creation/removal or generic DPU health/critical event counters, even though some of them works closely with HA workflows. -#### 2.2.2. HA SAI API counters +### 3.1. HA state and related health signals -All the HA SAI API counters will be: +To help simplify checking the HA states, `hamgrd` will aggregate the HA states and related health signals and store them in the HA scope state table. -* Saved in DPU side `DASH_COUNTER_DB`, as SAI APIs are called in DPU side syncd. -* Partitioned with ENI level key: `DASH_SAI_CALL_STATS|`. +Please refer to the [HA scope state](#2211-ha-scope-state) table in NPU DB for detailed DB schema design. -| Name | Description | -| --- | --- | -| *_(req/success/failure)_count | Number of SAI APIs we call or notifications we handle, with success and failure counters too. | -| total_*_(successful/failed)_time_in_us | Total time we used to do the SAI operations in microseconds. Successful and failed operations should be tracked separately, as they will have different patterns. | +### 3.2. Traffic forwarding related -### 2.3. HA control plane communication channel related +The following states and counters will help us monitor the NPU-to-DPU tunnel for traffic forwarding. -#### 2.3.1. HA control plane control channel counters +#### 3.2.1. NPU-to-DPU probe status (Per-HA Scope) -HA control plane control channel is running on NPU side, mainly used for passing the HA control commands. +This is managed by `swss` via overlay ECMP with BFD. Please refer to its HLD [here](https://github.com/r12f/SONiC/blob/user/r12f/ha/doc/vxlan/Overlay%20ECMP%20with%20BFD.md). -The counters of this channel will be: +#### 3.2.2. NPU-to-DPU tunnel counters (Per-HA Scope) -* Collected by `hamgrd` on NPU side. -* Saved in NPU side `DASH_COUNTER_DB`. -* Stored with key: `DASH_HA_CP_CONTROL_CHANNEL_STATS|`. - * This counter doesn’t need to be partitioned on a single switch, because it is shared for all ENIs. +On NPU side, we should have following counters to monitor how the NPU-to-DPU tunnel is working: | Name | Description | | --- | --- | -| is_alive | Is the channel alive for use. 0 = dead, 1 = alive. | -| channel_connect_count | Number of connect calls for establishing the data channel. | -| channel_connect_succeeded_count | Number of connect calls that succeeded. | -| channel_connect_failed_count | Number of connect calls that failed because of any reason other than timeout / unreachable. | -| channel_connect_timeout_count | Number of connect calls that failed due to timeout / unreachable. | +| packets_forwarded | Number of packets hitting the traffic forwarding rule. | +| bytes_forwarded | Number of bytes hitting the traffic forwarding rule. | -#### 2.3.2. HA control plane data channel counters +> TODO: Can we have counters for discards, error and oversize? -HA control plane data channel is composed with 2 parts: SAI flow API calls and `swbusd` for flow forwarding. The first one is already covered by all the SAI API counters above, so we will only focus on the `swbusd` part here. And the counters will be: +### 3.3. DPU traffic handling related -* Collected on `swbusd` on NPU side. -* Saved in NPU side `DASH_COUNTER_DB`. +On DPU side, we should have following counters to monitor how the DPU is handling the traffic. -##### 2.3.2.1. Per bulk sync flow receive server counters +#### 3.3.1. DPU level counters (Per-DPU) -Since the data channel is formed by multiple flow receive servers, the data plane counters needs to be logged per each server: `DASH_HA_CP_DATA_CHANNEL_CONN_STATS|`. +Although majority of the counters will be tracked on ENI level, but before the packet is landed on an ENI, it might fail in other places, such as incorrect packet format and so on. These counters will be tracked separately and will be stored per port basis in the port counters. + +Here are some examples of port counters we have: | Name | Description | | --- | --- | -| is_alive | Is the channel alive for use. 0 = dead, 1 = alive. | -| channel_connect_count | Number of connect calls for establishing the data channel. | -| channel_connect_succeeded_count | Number of connect calls that succeeded. | -| channel_connect_failed_count | Number of connect calls that failed because of any reason other than timeout / unreachable. | -| channel_connect_timeout_count | Number of connect calls that failed due to timeout / unreachable. | -| bulk_sync_message_sent/received | Number of messages we send or receive for bulk sync via data channel. | -| bulk_sync_message_size_sent/received | The total size of messages we send or receive for bulk sync via data channel. | -| bulk_sync_flow_received_from_local | Number of flows received from local DPU | -| bulk_sync_flow_forwarded_to_peer | Number of flows forwarded to paired DPU | +| SAI_PORT_STAT_IF_IN/OUT_UCAST_PKTS | Number of packets received / sent. | +| SAI_PORT_STAT_IF_IN/OUT_OCTETS | Total bytes received / sent. | +| SAI_PORT_STAT_IF_IN/OUT_DISCARDS| Number of incoming/outgoing packets get discarded. | +| SAI_PORT_STAT_IF_IN/OUT_ERRORS| Number of incoming/outgoing packets have errors like CRC error. | +| SAI_PORT_STAT_ETHER_RX/TX_OVERSIZE_PKTS | Number of incoming/outgoing packets exceeds the MTU. | -##### 2.3.2.2. Per ENI counters +For the full list of port counters, please refer to the `sai_port_stat_t` structure in [SAI header file](https://github.com/opencomputeproject/SAI/blob/master/inc/saiport.h). -Besides per flow receive server, the counters should also be tracked on ENI level, so we can have a more aggregated view for each ENI. The key can be: `DASH_HA_CP_DATA_CHANNEL_ENI_STATS|`. +Besides the standard SAI stats provided by each port, we also have the following counters extended in DASH: | Name | Description | | --- | --- | -| bulk_sync_message_sent/received | Number of messages we send or receive for bulk sync via data channel. | -| bulk_sync_message_size_sent/received | The total size of messages we send or receive for bulk sync via data channel. | -| bulk_sync_flow_received_from_local | Number of flows received from local DPU | -| bulk_sync_flow_forwarded_to_peer | Number of flows forwarded to paired DPU | - -> NOTE: We didn't add the ENI key in the per flow receive server counters, because multiple ENIs can share the same flow receive server. It is up to each vendor's implementation. +| SAI_PORT_STAT_ENI_MISS_DROP_PKTS | Number of packets that are dropped due to ENI not found. | +| SAI_PORT_STAT_VIP_MISS_DROP_PKTS | Number of packets that are dropped due to VIP not found. | -### 2.4. NPU-to-DPU tunnel related (NPU side) +#### 3.3.2. ENI-level traffic counters (Per-ENI) -The second part of the HA is the NPU-to-DPU tunnel. This includes the probe status and traffic information on the tunnel. +Once the packet lands on an ENI, we should have the following counters to monitor how much traffic is handled by each ENI: -#### 2.4.1. NPU-to-DPU probe status +| Name | Description | +| -------------- | ----------- | +| SAI_ENI_STAT_(/OUTBOUND_/INBOUND_)RX_BYTES | Total bytes received on ENI (overall/outbound/inbound) pipeline. | +| SAI_ENI_STAT_(/OUTBOUND_/INBOUND_)RX_PACKETS | Total number of packets received on ENI (overall/outbound/inbound) pipeline. | +| SAI_ENI_STAT_(/OUTBOUND_/INBOUND_)TX_BYTES | Total bytes sent by ENI (overall/outbound/inbound) pipeline. | +| SAI_ENI_STAT_(/OUTBOUND_/INBOUND_)TX_PACKETS | Total number of packets sent by ENI (overall/outbound/inbound) pipeline. | -Latest probe status is critical for checking how each card and ENI performs, and where the packets should be forwarded to. +#### 3.3.3. ENI-level pipeline drop counters (Per-ENI) -Please refer to the [DPU/vDPU HA state](#1241-dpu--vdpu-ha-states) tables in NPU DB for detailed DB schema design. +When the packet is landed on the ENI going throught each match stages, it might be dropped due to no entries can be matched, such as routing or CA-PA mapping. In order to show these drops, we should have the following counters: -#### 2.4.2. NPU-to-DPU data plane state +| Name | Description | +| -------------- | ----------- | +| SAI_ENI_STAT_OUTBOUND_ROUTING_ENTRY_MISS_DROP_PKTS | Number of packets that are dropped due to outbound routing entry not found. | +| SAI_ENI_STAT_OUTBOUND_CA_PA_ENTRY_MISS_DROP_PKTS | Number of packets that are dropped due to outbound routing entry not found. | +| SAI_ENI_STAT_TUNNEL_MISS_DROP_PKTS | Number of packets that are dropped due to outbound routing entry not found. | +| SAI_ENI_STAT_INBOUND_ROUTING_MISS_DROP_PKTS | Number of packets that are dropped due to outbound routing entry not found. | -Depending on the probe status and HA state, we will update the next hop for each ENI to forward the traffic. This also needs to be tracked. +#### 3.3.4. ENI-level flow operation counters (Per-ENI) -Please refer to the [DASH_ENI_DP_STATE_TABLE](#1242-eni-state) table in NPU DB for detailed DB schema design. +| Name | Description | +| -------------- | ----------- | +| SAI_ENI_STAT_FLOW_CREATED | Total flow created on ENI. | +| SAI_ENI_STAT_FLOW_CREATE_FAILED | Total flow failed to create on ENI. | +| SAI_ENI_STAT_FLOW_UPDATED | Total flow updated on ENI. | +| SAI_ENI_STAT_FLOW_UPDATE_FAILED | Total flow failed to update on ENI. | +| SAI_ENI_STAT_FLOW_DELETED | Total flow deleted on ENI. | +| SAI_ENI_STAT_FLOW_DELETE_FAILED | Total flow failed to delete on ENI. | +| SAI_ENI_STAT_FLOW_AGED | Total flow aged out on ENI. A flow is aged out doesn't mean the flow entry is deleted. It could be marked as pending deletion and get deleted later. | -#### 2.4.3. NPU-to-DPU tunnel counters +### 3.4. Flow sync counters -On NPU side, we should also have ENI level tunnel traffic counters: +Flow HA has 2 ways to sync the flows from active to standby side: inline sync and bulk sync. And below are the counters that added for monitoring them. -* Collected on the NPU side via SAI. -* Saved in the NPU side `COUNTER_DB`. -* Partitioned into ENI level with key: `DASH_HA_NPU_TO_ENI_TUNNEL_STATS|`. +#### 3.4.1. Data plane channel probing (Per-HA Set) | Name | Description | -| --- | --- | -| packets_in/out | Number of packets received / sent. | -| bytes_in/out | Total bytes received / sent. | -| packets_discards_in/out | Number of incoming/outgoing packets get discarded. | -| packets_error_in/out | Number of incoming/outgoing packets have errors like CRC error. | -| packets_oversize_in/out | Number of incoming/outgoing packets exceeds the MTU. | +| -------------- | ----------- | +| SAI_HA_SET_STAT_DP_PROBE_(REQ/ACK)_RX_BYTES | The bytes of data plane probes that this HA set received. | +| SAI_HA_SET_STAT_DP_PROBE_(REQ/ACK)_RX_PACKETS | The number of packets of data plane probes that this HA set received. | +| SAI_HA_SET_STAT_DP_PROBE_(REQ/ACK)_TX_BYTES | The bytes of data plane probes that this HA set sent. | +| SAI_HA_SET_STAT_DP_PROBE_(REQ/ACK)_TX_PACKETS | The number of packets of data plane probes that this HA set sent. | +| SAI_HA_SET_STAT_DP_PROBE_FAILED | The number of probes that failed. The failure rate = the number of failed probes / the number of tx packets. | -> NOTE: In implementation, these counters might have a more SAI-friendly name. +#### 3.4.2. Inline flow sync (Per-ENI) -### 2.5. NPU-to-DPU tunnel related (DPU side) +| Name | Description | +| -------------- | ----------- | +| SAI_ENI_STAT_(INLINE/TIMED)_FLOW_SYNC_PACKET_RX_BYTES | The bytes of inline/timed flow sync packet received by the ENI. | +| SAI_ENI_STAT_(INLINE/TIMED)_FLOW_SYNC_PACKET_RX_PACKETS | The number of inline/timed flow sync packets received by the ENI. | +| SAI_ENI_STAT_(INLINE/TIMED)_FLOW_SYNC_PACKET_TX_BYTES | The bytes of inline/timed flow sync packet that this ENI sent. | +| SAI_ENI_STAT_(INLINE/TIMED)_FLOW_SYNC_PACKET_TX_PACKETS | The number of inline/timed flow sync packets that this ENI sent. | -On DPU side, every NPU-to-DPU tunnel traffic needs to be tracked on ENI level as well: +Besides the packet level counters, we should also have flow operation level counters: -* Collected on the DPU side via SAI. -* Saved in DPU side `COUNTER_DB`. -* Partitioned into ENI level with key: `DASH_HA_NPU_TO_ENI_TUNNEL_STATS|`. +* The number of flow operations can be different from the number of flow sync packets. Depending on implementation, a single flow sync packet can carry multiple flow operations. +* The flow operation could fail or be ignored by ENI. + * Failed means it is unexpected to receive packet, and we failed to process it. + * Ignored means packet is expected be received, but we should ignore the flow operation inside and move on without dropping the packet. E.g., more packet arrives before flow sync is ack'ed. | Name | Description | -| --- | --- | -| packets_in/out | Number of packets received / sent. | -| bytes_in/out | Total bytes received / sent. | -| packets_discards_in/out | Number of incoming/outgoing packets get discarded. | -| packets_error_in/out | Number of incoming/outgoing packets have errors like CRC error. | -| packets_oversize_in/out | Number of incoming/outgoing packets exceeds the MTU. | +| -------------- | ----------- | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_REQ_SENT | The number of inline/timed flow create/update/delete request that the ENI sent. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_REQ_RECV | The number of inline/timed flow create/update/delete request that the ENI received. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_REQ_FAILED | The number of inline/timed flow create/update/delete request that the ENI received but failed to process. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_REQ_IGNORED | The number of inline/timed flow create/update/delete request that the ENI received but its flow operation is processed as ignored. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_ACK_RECV | The number of inline/timed flow create/update/delete ack that the ENI is received. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_ACK_FAILED | The number of inline/timed flow create/update/delete ack that the ENI is received but failed to process. | +| SAI_ENI_STAT_(INLINE/TIMED)\_FLOW\_(CREATE/UPDATE/DELETE)_ACK_IGNORED | The number of inline/timed flow create/update/delete ack that the ENI is received but its flow operation is processed as ignored. | -> NOTE: In implementation, these counters might have a more SAI-friendly name. +#### 3.4.3. Bulk sync related counters (Per-HA Set) -### 2.6. DPU-to-DPU data plane channel related +The control plane data channel will be used for bulk sync. It is implemented using a gRPC server on each DPU. -The next part is the DPU-to-DPU data plane channel, which is used for inline flow replications. - -* Collected on the DPU side via SAI. -* Saved in DPU side `COUNTER_DB`. -* Partitioned into ENI level with key: `DASH_HA_DPU_DATA_PLANE_STATS|`. +To monitor how it works, we should have the following counters on HA set level: | Name | Description | | --- | --- | -| inline_sync_packet_in/out | Number of inline sync packet received / sent. | -| inline_sync_ack_packet_in/out | Number of inline sync ack packet received / sent. | -| meta_sync_packet_in/out | Number of metadata sync packet (generated by DPU) received / sent. This is for flow sync packets of flow aging, etc. | -| meta_sync_ack_packet_in/out | Number of metadata sync ack packet received / sent. This is for flow sync packets of flow aging, etc. | -| probe_packet_in/out | Number of probe packet received from or sent to the paired ENI on the other DPU. This data is for DPU-to-DPU data plane liveness probe. | -| probe_packet_ack_in/out | Number of probe ack packet received from or sent to the paired ENI on the other DPU. This data is for DPU-to-DPU data plane liveness probe. | - -> NOTE: In implementation, these counters might have a more SAI-friendly name. +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_CONNECT_ATTEMPTED | Number of connect calls for establishing the data channel. | +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_CONNECT_RECEIVED | Number of connect calls received to estabilish the data channel. | +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_CONNECT_SUCCEEDED | Number of connect calls that succeeded. | +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_CONNECT_FAILED | Number of connect calls that failed because of any reason other than timeout / unreachable. | +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_CONNECT_REJECTED | Number of connect calls that rejected due to certs and etc. | +| SAI_HA_SET_STAT_CP_DATA_CHANNEL_TIMEOUT_COUNT | Number of connect calls that failed due to timeout / unreachable. | -### 2.7. DPU ENI pipeline related - -The last part is how the DPU ENI pipeline works in terms of HA, which includes flow operations: - -* Collected on the DPU side via SAI. -* Saved in DPU side `COUNTER_DB`. -* Partitioned into ENI level with key: `DASH_HA_DPU_PIPELINE_STATS|`. +Besides the channel status, we should also have the following counters for the bulk sync messages: | Name | Description | | --- | --- | -| flow_(creation/update/deletion)_count | Number of inline flow creation/update/delete request that failed for any reason. E.g. not enough memory, update non-existing flow, delete non-existing flow. | -| inline_flow_(creation/update/deletion)_req_sent | Number of inline flow creation/update/deletion request that sent from active node. Flow resimulation will be covered in flow update requests. | -| inline_flow_(creation/update/deletion)_req_received | Number of inline flow creation update/deletion request that received on standby node. | -| inline_flow_(creation/update/deletion)_req_succeeded | Number of inline flow creation update/deletion request that succeeded (ack received). | -| flow_creation_conflict_count | Number of inline replicated flow that is conflicting with existing flows (flow already exists and action is different). | -| flow_aging_req_sent | Number of flows that aged out in active and being replicated to standby. | -| flow_aging_req_received | Number of flow aging requests received from active side. Request can be batched, but in this counter 1 request = 1 flow. | -| flow_aging_req_succeeded | Number of flow aging requests that succeeded (ack received). | - -Please note that we will also have counters for how many flows are created/updated/deleted (succeeded or failed), aged out or resimulated, but this is not in the scope of HA, hence omitted here. - -> NOTE: In implementation, these counters might have a more SAI-friendly name. +| SAI_HA_SET_STAT_BULK_SYNC_MESSAGE_RECEIVED | Number of messages we received for bulk sync via data channel. | +| SAI_HA_SET_STAT_BULK_SYNC_MESSAGE_SENT | Number of messages we sent for bulk sync via data channel. | +| SAI_HA_SET_STAT_BULK_SYNC_MESSAGE_SEND_FAILED | Number of messages we failed to sent for bulk sync via data channel. | +| SAI_HA_SET_STAT_BULK_SYNC_FLOW_RECEIVED | Number of flows received from bulk sync message. A single bulk sync message can contain many flow records. | +| SAI_HA_SET_STAT_BULK_SYNC_FLOW_SENT | Number of flows sent via bulk sync message. A single bulk sync message can contain many flow records. | -## 3. SAI APIs +## 4. SAI APIs Please refer to HA session API and flow API HLD in DASH repo for SAI API designs. -## 4. CLI commands +## 5. CLI commands The following commands shall be added in CLI for checking the HA config and states: -* `show dash ha config`: Shows HA global configuration. -* `show dash eni ha config`: Show the ENI level HA configuration. -* `show dash eni ha status`: Show the ENI level HA status. -* `show dash eni ha dp-status`: Show the ENI level data path status. +* `show dash ha global-config`: Shows HA global configuration. +* `show dash ha set status`: Show the HA set status. +* `show dash ha set counters`: Show the HA set counters. +* `show dash ha scope status [ha-scope-id]`: Show the HA scope status. +* `show dash ha scope counters [ha-scope-id]`: Show the HA scope counters. +* `show dash ha flow-sync-session counters [flow-sync-session-id]`: Show the HA flow sync session counters. diff --git a/doc/smart-switch/high-availability/smart-switch-ha-dpu-scope-dpu-driven-setup.md b/doc/smart-switch/high-availability/smart-switch-ha-dpu-scope-dpu-driven-setup.md new file mode 100644 index 0000000000..1990cf11ee --- /dev/null +++ b/doc/smart-switch/high-availability/smart-switch-ha-dpu-scope-dpu-driven-setup.md @@ -0,0 +1,388 @@ +# SmartSwitch High Availability High Level Design - DPU-Scope-DPU-Driven setup + +| Rev | Date | Author | Change Description | +| --- | ---- | ------ | ------------------ | +| 0.1 | 02/26/2024 | Riff Jiang | Initial version | +| 0.2 | 05/08/2024 | Mukesh Moopath Velayudhan | Add HA states, activate role and split-brain handling for DPU-driven HA. | +| 0.3 | 06/10/2024 | Riff Jiang | Improve HA state, activate role and fix the BFD responder workflow for DPU-driven HA. | + +1. [1. Terminology](#1-terminology) +2. [2. Background](#2-background) +3. [3. DPU-Scope-DPU-Driven setup Overview](#3-dpu-scope-dpu-driven-setup-overview) +4. [4. Network Physical Topology](#4-network-physical-topology) + 1. [4.1. DPU-level HA scope](#41-dpu-level-ha-scope) + 2. [4.2. DPU-level NPU to DPU traffic forwarding](#42-dpu-level-npu-to-dpu-traffic-forwarding) +5. [5. ENI programming with HA setup](#5-eni-programming-with-ha-setup) +6. [6. DPU liveness detection](#6-dpu-liveness-detection) + 1. [6.1. Card level NPU-to-DPU liveness probe](#61-card-level-npu-to-dpu-liveness-probe) + 2. [6.2. DPU-to-DPU liveness probe](#62-dpu-to-dpu-liveness-probe) +7. [7. HA state machine management](#7-ha-state-machine-management) + 1. [7.1. HA states](#71-ha-states) + 2. [7.2. HA role activation](#72-ha-role-activation) +8. [8. Planned operations](#8-planned-operations) + 1. [8.1. HA set creation](#81-ha-set-creation) + 2. [8.2. Planned shutdown](#82-planned-shutdown) + 1. [8.2.1. Shutdown standby DPU](#821-shutdown-standby-dpu) + 2. [8.2.2. Shutdown active DPU](#822-shutdown-active-dpu) + 3. [8.3. Planned switchover](#83-planned-switchover) + 4. [8.4. ENI migration / HA re-pair](#84-eni-migration--ha-re-pair) +9. [9. Unplanned operations](#9-unplanned-operations) + 1. [9.1. Unplanned failover](#91-unplanned-failover) +10. [10. Split-brain and re-pair](#10-split-brain-and-re-pair) +11. [11. Flow tracking and replication](#11-flow-tracking-and-replication) +12. [12. Detailed design](#12-detailed-design) +13. [13. Test Plan](#13-test-plan) + +## 1. Terminology + +| Term | Explanation | +| ---- | ----------- | +| HA | High Availability. | +| NPU | Network Processing Unit. | +| DPU | Data Processing Unit. | +| ENI | Elastic Network Interface. | +| SDN | Software Defined Network. | +| VIP | Virtual IP address. | + +## 2. Background + +This document adds and describes the high level design of `DPU-Scope-DPU-Driven` setup in SmartSwitch High Availability (HA), as an extension to our main SmartSwitch HA design document [here](../smart-switch-ha.md), which describes only how the ENI-level HA works (or how `NPU-Driven-ENI-Scope` setup works). + +Many things in this setup will shares the same or very similar approach as the ENI-level HA, hence, this document will focus on the differences and new things that are specific to this setup. + +## 3. DPU-Scope-DPU-Driven setup Overview + +In SmartSwitch HA design, there are a few key characteristics that defines the high level behavior of HA: + +- **HA pairing**: How the ENIs are placed amoung all DPUs to form the HA set? +- **HA owner**: Who drives the HA state machine on behalf of SDN controller? +- **HA scope**: At which level, the HA state machine is managed? This determines how the traffic is forwarded from NPU to DPU. +- **HA mode**: How the DPUs coordinate with each other to achieve HA? + +From these characteristics, here is the main differences between `DPU-Scope-DPU-Driven` setup and the `NPU-Driven-ENI-Scope` setup in the main HLD: + +| Characteristic | `DPU-Scope-DPU-Driven` setup | ENI-level HA setup | +| -------------- | ------------------------ | ------------ | +| HA pairing | Card-level pairing. | Card-level pairing. | +| HA scope | DPU-level HA scope. | ENI-level HA scope. | +| HA owner | DPU drives the HA state machine. | `hamgrd` in NPU drives HA state machine. | +| HA mode | Active-standby | Active-standby | + +## 4. Network Physical Topology + +The network physical topology for `DPU-Scope-DPU-Driven` setup will be the same as the ENI-level HA setup, e.g. where the NPU/DPU is placed and wired. The main difference is that the HA scope is on the level of all ENIs or a group of ENIs rather than a single ENI. The current deployment will use only 1 HA scope per DPU, so all ENIs on the DPU are mapped to the same HA scope. + +This results in the following differences in the network physical topology and the figure below captures the essence of the differences: + +![](./images/ha-scope-dpu-level.svg) + +### 4.1. DPU-level HA scope + +With DPU-level HA scope, the HA state is controlled at the DPU level. This means, in active-standby mode, all ENIs on a DPU will be set to active or standby together. + +### 4.2. DPU-level NPU to DPU traffic forwarding + +Due to the essense of the DPU-level HA scope, the traffic forwarding from NPU to DPU will be done on DPU level: + +- Each DPU pair will use a dedicated VIP per HA scope for traffic forwarding from NPU to DPU. +- All VIPs for all DPUs in the SmartSwitch will be pre-programmed into the NPU and advertised to the network as a single VIP range (or subnet). +- NPU will setup a route to match the packet on the destination VIP, instead of ACLs to match both the VIP and inner MAC. + +The data path and packet format will follow the same as the ENI-level HA setup [as described in the main HA design doc](./smart-switch-ha-hld.md#421-eni-level-npu-to-dpu-traffic-forwarding). + +## 5. ENI programming with HA setup + +The services and high level architecture that are used in the `DPU-Scope-DPU-Driven` setup will be the same as the ENI-level HA setup, such as, `hamgrd` and etc. + +The high level workflow also follows the main HA design doc, more specifically: + +- Before programming the ENIs, SDN controller creates the DPU pair by programming the HA set into all the NPUs that receives the traffic. `hamgrd` will get notified and call `swss` to setup all the traffic forwarding rules on NPU as well as send the DPU pair information to the DPU. +- After HA set is programmed, SDN controller can create the ENIs and program all the SDN policies for each ENI independently. + +For more details on the contract and design, please refer to the detailed design section. + +## 6. DPU liveness detection + +In `DPU-Scope-DPU-Driven` setup, `hamgrd` will not drive the HA state machine and DPU will drive the HA at DPU level, hence we don't have the ENI-level traffic control, but the [Card level NPU-to-DPU liveness probe based on BFD](./smart-switch-ha-hld.md#61-card-level-npu-to-dpu-liveness-probe) and [DPU-to-DPU liveness probe](./smart-switch-ha-hld.md#63-dpu-to-dpu-liveness-probe) will still be used. + +### 6.1. Card level NPU-to-DPU liveness probe + +Same as ENI-level HA setup, in `DPU-Scope-DPU-Driven` setup, the BFD probe will: + +- Setup on all NPUs for both IPv4 and IPv6 and probe both DPUs. +- Both DPUs will respond the BFD probes, no matter it is active or standby. +- Still only be used for controlling the traffic forwarding behavior from NPU to DPU, and won't be used as the health signal for triggering any failover inside DPU. + +Unlike the ENI-level HA setup, BFD probe can work alone to make the traffic forwarding decision without ENI level info. To achieve this, the SDN controller will program the HA set to NPU with the preferred DPU as active/standby setting. + +Here is the how the probe works in details (with the DPU0 as preferred DPU): + +| DPU0 | DPU1 | Preferred DPU | Next hop | Comment | +| --- | --- | --- | --- | --- | +| Down | Down | DPU0 | DPU0 | Both down is essentially the same as both up, hence effect is the same as Up+Up. | +| Down | Up | DPU0 | DPU1 | NPU will forward all traffic to DPU1, because DPU0 is not reachable. | +| Up | Down | DPU0 | DPU0 | NPU will forward all traffic to DPU0, since DPU0 is preferred and reachable. | +| Up | Up | DPU0 | DPU0 | If both DPU is up, then we respect the preferred DPU. | + +For more details on NPU-to-DPU probes, + +- The data path and packet format of the BFD probe will be the same as the one defined in the main HA design doc. Please refer to the [Card level NPU-to-DPU liveness probe design](./smart-switch-ha-hld.md#61-card-level-npu-to-dpu-liveness-probe). +- For the detailed design of BFD probe in SmartSwitch, please refer to [SmartSwitch BFD detailed design doc](https://github.com/sonic-net/SONiC/pull/1635). + +### 6.2. DPU-to-DPU liveness probe + +In `DPU-Scope-DPU-Driven` setup, the DPU-to-DPU liveness probe will still be used as health signal for triggering DPU failover. + +However, unlike the ENI-level HA setup, upon DPU-to-DPU probe failure, DPU will drive the HA state machine and failover by itself, without the help of `hamgrd`. + +The data path and packet format of the DPU-to-DPU probe will be the same as the one defined in the main HA design doc. Please refer to the [DPU-to-DPU data plane channel design](./smart-switch-ha-hld.md#4352-dpu-to-dpu-data-plane-channel). + +## 7. HA state machine management + +In `DPU-Scope-DPU-Driven` setup, the HA state machine will be managed by DPU itself, and `hamgrd` will not drive the HA state machine, but only be used for generating the configurations for NPU and DPU, as well as collecting the telemetry and report in the context of HA whenever is needed. + +### 7.1. HA states + +Since DPU will be driving the HA state machine transition, any HA state change will needs to be reported, otherwise the SDN controller will not be able to know the current HA state of the DPU. + +Since different DPU could drive the state machine differently, to normalize the states, we provides the following states for describing the behaviors. However, the transitions are not defined in this design doc, this allows minor differences between implementations. + + | State | Description | + | ----- | ------------------------------------ | + | Dead | Initial state. Not participating in HA | + | Connecting | Trying to connect to its HA pair | + | Connected | Connection successful and starting active / standby selection. | + | InitializingToActive | Bulk sync in progress. After bulk sync, this DPU will become active after activation. | + | InitializingToStandby | Bulk sync in progress. After bulk sync, this DPU will become standby after activation. | + | PendingActiveActivation | Bulk sync successful, waiting for role activation from SDN controller to go to Active. Dormant state since BFD disabled hence no traffic. | + | PendingStandbyActivation | Bulk sync successful, waiting for role activation from SDN controller to go to Standby. Dormant state since BFD disabled hence no traffic. | + | PendingStandaloneActivation | Could not connect to pair, waiting for role activation from SDN controller to go to Standalone. Dormant state since BFD disabled hence no traffic. | + | Standalone | Standalone role is activated. Responding BFD and forwarding traffic | + | Active | Active role is activated. Responding BFD, forwarding traffic and syncing flow to its pair | + | Standby | Standby role is activated. Responding BFD and receive flow sync from its pair | + | Destroying | Going down for a planned shutdown | + | SwitchingToStandalone | Gracefully transitioning from paired state to standalone | + + > NOTE: Some DPUs might support traffic forwarding in the standby state for existing flow as well. However the flow decision can still only be made on the active side, which matches the concepts in the HA design. Hence, this is an implementation detail and not a requirement of the HA design at this moment. + +### 7.2. HA role activation + +After the bulk sync is done and before the DPU moves to `Active` or `Standby` state and take traffic, The DPU is required to go through the HA role activation process. + +The purpose of the HA role activation is for ensuring the ENIs (and policies) on the card are up-to-date. Although bulk sync makes the flow table in sync on the 2 DPUs, but it doesn't guarantee the ENIs are up-to-date. If any ENI or policy is missing, existing flows can be dropped (due to ENI missing) or new flows can be established with wrong policy (due to policy missing). And role activation is the process to help bridge this gap. + +In `DPU-driven` mode, DPU shall first move into `PendingActive/Standby/StandbyActivation` state, then send a SAI notification on the HA scope as a role activation request. The request will go through `hamgrd` and eventually be delivered to SDN controller. The SDN controller will ensures the policies become identical on both cards, then approve the requests. Once the DPU receives the approval, it can then move to `Active`, `Standby` or `Standalone` state. + +## 8. Planned operations + +There are a few things we need to notice for the planned operations: + +- Same as the main HA HLD: + - Planned operations are always initiated by SDN controller. + - All HA state changes and counters will be reported from ASIC eventually to SDN controller via gNMI interfaces, hence omitted in the following workflow. +- All operations will be happening on DPU-level, for example, switchover or removing from HA set. +- Since DPU will be driving the HA state machine, `hamgrd` will only be used for passing through the configurations and report the states and telemetry. + +Here are how the workflows look like for the typical planned operations: + +### 8.1. HA set creation + +The HA bring-up workflow is described below: + +> Please note again that different implementation could have slight differences on the state machine transition, so the workflow below only defines the hard requirements for entering and exiting each state during the workflow. + +1. The DPU starts out with its initial HA role as `Dead`. +2. SDN controller pushes all DASH configurations to DPU including the HA set and the HA scope: + 1. The SDN controller sets HA scope `Role` to `Active` or `Standby`, `AdminState` to `Disabled`. +3. Once all configurations have been pushed, SDN controller starts the HA state machine on the DPU by updating the HA scope `AdminState` to `Enabled`. +4. DPU moves to `Connecting` state and attempts to connect to its pair specified in the HA set. +4. If the DPU fails to connect to it peer: + 1. It needs to first move to `PendingStandaloneActivation` state and send role activation request to SDN controller. +5. If the DPU connects to it peer successfully, + 1. It needs to move to `Connected` state and decide if it needs to be become `Active` or `Standby` and start bulk sync if needed. + 2. Once bulk sync is done, the DPU can move to `PendingActive/StandbyActivation` state and send role activation request to SDN controller. +6. Once activation is approved, DPU will receive the approval and move to `Active`/`Standby`/`Standalone` state after activation. +7. Once the state is moved to `Active`/`Standdy`/`Standalone` state, `hamgrd` will create the BFD responder on DPU side and start responding to BFD probes. +8. When each switch sees the BFD response, it will start forwarding traffic to the DPU whose `Role` is set to `Active`. + +```mermaid +sequenceDiagram + autonumber + + participant S0N as Switch 0 NPU + participant S0D as Switch 0 DPU
(Desired Active) + participant S1D as Switch 1 DPU
(Desired Standby) + participant S1N as Switch 1 NPU + participant SA as All Switches
(Includes Switch 0/1) + participant SDN as SDN Controller + + SDN->>SA: Create HA set and HA scope
on all switches with
desired Active/Standby roles. + SA->>SA: Start BFD probe to all DPUs
but DPUs do not respond yet.
+ SDN->>S0N: Push all DASH configuration objects to DPU0. This includes the HA set and HA scope with
desired Active/Standby roles but Admin state set to disabled. + SDN->>S1N: Push all DASH configuration objects to DPU1.
This includes the HA set and HA scope with
desired Active/Standby roles but
Admin state set to disabled. + S0N->>S0D: Create all DASH objects
including HA set and HA scope
with admin state to Disabled. + S1N->>S1D: Create all DASH objects
including HA set and HA scope
with admin state to Disabled. + + SDN->>S0N: Set DPU0 HA scope
admin state to Enabled + SDN->>S1N: Set DPU1 HA scope
admin state to Enabled + S0N->>S0D: Set HA scope
admin state to Enabled + S1N->>S1D: Set HA scope
admin state to Enabled + + S0D->>S1D: Connect to peer and
start pairing + S1D->>S0D: Connect to peer and
start pairing + Note over S0D,S1D: Bulk sync can happen at
this stage if needed. + + S0D->>S0N: Request role activation + S1D->>S1N: Request role activation + S0N->>SDN: Request role activation for DPU0 + S1N->>SDN: Request role activation for DPU1 + Note over S0D,S1D: The DPUs will wait for activation
from SDN controller before
moving to active/standby state. + Note over S0D,S1D: Inline sync channel can be
established at this stage. + + SDN->>S0N: Approve role activation for DPU0 + SDN->>S1N: Approve role activation for DPU1 + S0N->>S0D: Approve role activation + S1N->>S1D: Approve role activation + + S0D->>S0N: Enter active state + S0N->>S0D: Request to start responding to all BFD probes + S1D->>S1N: Enter standby state + S1N->>S1D: Request to start responding to all BFD probes + S0D->>S0D: Start responding to all BFD probes + S1D->>S1D: Start responding to all BFD probes + + SA->>SA: BFD to both DPUs will be up,
but only DPU0 will be set as next hop + Note over S0N,SA: From now on, traffic for all ENIs in this HA scope will be forwarded to DPU0. +``` + +### 8.2. Planned shutdown + +With `DPU-driven` mode, the shutdown request will be directly forwarded to DPU. `hamgrd` will ***NOT*** work with each other to make sure the shutdown is done in a safe way. + +Let's say, DPUs 0 and 1 are in steady state (Active/Standby) in the HA set and we are trying to shutdown DPU 0. The workflow of planned shutdown will be as follows: + +1. SDN controller programs the DPU 0 HA scope desired HA state to `dead`. +2. `hamgrd` and DPU side `swss` will: + 1. Shutdown the BFD responder on DPU side, which moves all traffic to DPU 1. + 2. Update the DPU 0 HA role to `Dead`. +3. DPU 0 moves itself to `Destroying` state. + 1. At this stage, traffic can initially land on both DPUs and until it fully shifts to DPU 1. + 2. DPU 0 needs to gracefully shutdown and help DPU 1 move to `Standalone` state while continuing to sync flows. + 3. DPU 0 will start a configurable timer to wait for the network to converge and traffic to fully switchover to DPU 1. + 4. Once this timer expires, no traffic should land on DPU 0. DPU 0 stops forwarding traffic and starts shutting down. +4. At the same time that DPU 0 is shutting down, DPU 1 will: + 1. Move to `SwitchingToStandalone` to drain all the in-flight messages between the DPUs and then finally `Standalone` state when DPU 0 is fully shutdown. + 2. If DPU 1 was in `Standby` state, it needs to pause flow resimulation for all connections synced from old Active and send flow reconcile request to SDN controller. + 3. Only after flow reconcile is received from SDN controller, the flow resimulation will be resumed for these connections. This is to ensure the flow resimulation won't accidentally picking up old SDN policy. + +The above DPU state transitions remain same irrespective of which DPU is shutdown. The only difference is if standby DPU 1 was being shutdown instead of DPU 0, then the switch NPUs would continue to forward to DPU 0 without any need to shift traffic. + + +```mermaid +sequenceDiagram + autonumber + + participant S0N as Switch 0 NPU + participant S0D as Switch 0 DPU
(Active->Dead) + participant S1D as Switch 1 DPU
(Standby->Active) + participant S1N as Switch 1 NPU + participant SA as All Switches
(Includes Switch 0/1) + participant SDN as SDN Controller + + Note over SA: Initially, traffic for all ENIs in
this HA scope will be forwarded to DPU0. + + SDN->>S0N: Programs HA scope with dead desired state for DPU0. + S0N->>S0D: Request to stop responding to BFD. + + S0D->>S0D: Stop responding BFD. + SA->>SA: Set DPU1 as next hop for traffic forwarding. + Note over SA: Traffic for all ENIs in this HA scope
will start transitioning if required. + + S0N->>S0D: Update HA scope with dead state. + S0D->>S0N: Enter Destroying state + Note over S0D,S1D: DPU starts to drive internal
HA states to gracefully
remove the peer. + Note over S0D,S1D: DPU waits for
network convergence timer. + + S1D->>S1N: Enter SwitchingToStandalone state + Note over S0D,S1D: Inline sync channel
should be stopped. + + S0D->>S0N: Enter dead state + S1D->>S1N: Enter standalone state + Note over S0D,S1D: DPU1 ignores all flow
resimulation requests
for synced connecctions + + S1D->>S1N: Send flow reconcile needed notification + S1N->>SDN: Send flow reconcile needed notification + SDN->>S1N: Ensure the latest policy is programmed + SDN->>S1N: Request flow reconcile + S1N->>S1D: Request flow reconcile + Note over S0D,S1D: DPU1 resumes handling all flow
resimulation requests +``` + +### 8.3. Planned switchover + +In DPU-driven setup, switchover is done via shutdown one side of the DPU, and DPUs pair need to be able to handle the switchover internally. + +### 8.4. ENI migration / HA re-pair + +To support things like upgrade, we need to update the HA set to pair with another DPU. In this case, the following steps needs to be performed step by step: + +1. Trigger [Planned shutdown](#82-planned-shutdown) on the DPU that needs to be removed from the HA set. +2. Update HA set information on all switches and the corresponding DPUs. + - This will cause the tables and objects related to old HA set to be removed and new HA set to be created. + - The new DPU joining the HA pair will be in Dead state at this point. +3. Program all ENIs on the new DPU. +4. Once all configurations are done, SDN controller updates the HA admin state to Enabled to start the [HA set creation](#81-ha-set-creation) workflow. + +## 9. Unplanned operations + +### 9.1. Unplanned failover + +```mermaid +sequenceDiagram + autonumber + + participant S0N as Switch 0 NPU + participant S0D as Switch 0 DPU
(Active->Dead) + participant S1D as Switch 1 DPU
(Standby->Standalone) + participant S1N as Switch 1 NPU + participant SA as All Switches
(Includes Switch 0/1) + participant SDN as SDN Controller + + Note over S0N,SA: Initially, traffic for all ENIs in
this HA scope will be forwarded to DPU0. + destroy S0D + S0D-XS0D: DPU0 went dead + S0N->>S0N: PMON detects DPU failure
and update DPU state to dead. + + SA->>SA: BFD to DPU0 start to fail. + Note over SA: Traffic for all ENIs in this HA scope
starts to shift to DPU1. + + S1D->>S1D: DPU-to-DPU probe starts to fail. + Note over S1D: DPU starts to drive internal
HA states to failover. + + S1D->>S1N: Enter standalone state + Note over S1D: DPU1 starts to ignore all flow
resimulation requests
for synced connections + + S1D->>S1N: Send flow reconcile needed notification + S1N->>SDN: Send flow reconcile needed notification + SDN->>S1N: Ensure the latest policy is programmed + SDN->>S1N: Request flow reconcile + S1N->>S1D: Request flow reconcile +``` + +## 10. Split-brain and re-pair + +In DPU-driven HA mode, depends on how DPU detects the health state of its pair, it could run into split-brain problem, when the DPU-DPU communication channel breaks while both DPUs are still up. In this case, both DPU could move into `Standalone` state and both starts to make flow decisions. + +When this happens, the DPU could not recover by its own and shall refuse to re-pair automatically. It is the responsibility of the SDN controller to break the split-brain by restarting HA on one of the DPUs (setting HA scope disabled to `true` and then `false`). Unlike the desired HA state, which will drive the state machine gracefully, the disabled state will force the DPU to shutdown. This operation is required, because the DPU HA state machine could stuck and could not never be recovered gracefully. + +## 11. Flow tracking and replication + +`DPU-Scope-DPU-Driven` setup will not change the flow lifetime is managed and how the inline flow replication works, since they are currently managed by DPU under the SAI APIs already. However, it will change how bulk sync works, as DPU will directly do the bulk sync without going through HA control plane sync channel. + +## 12. Detailed design + +Please refer to the [detailed design doc](./smart-switch-ha-detailed-design.md) for DB schema, telemetry, SAI API and CLI design. + +## 13. Test Plan + +Please refer to HA test docs for detailed test bed setup and test case design. diff --git a/doc/smart-switch/high-availability/smart-switch-ha-hld.md b/doc/smart-switch/high-availability/smart-switch-ha-hld.md index 3c0d204e4e..8887360163 100644 --- a/doc/smart-switch/high-availability/smart-switch-ha-hld.md +++ b/doc/smart-switch/high-availability/smart-switch-ha-hld.md @@ -8,6 +8,7 @@ | 0.4 | 08/17/2023 | Riff Jiang | Redesigned HA control plane data channel | | 0.5 | 10/14/2023 | Riff Jiang | Merged resource placement and topology section and moved detailed design out for better readability | | 0.6 | 10/22/2023 | Riff Jiang | Added ENI leak detection | +| 0.7 | 10/13/2024 | Riff Jiang | Update HA control plane components graph to match with latest design update on database and gNMI. | 1. [1. Background](#1-background) 2. [2. Terminology](#2-terminology) @@ -153,6 +154,8 @@ 2. [11.5.2.2. Flow tracking in steady state](#11522-flow-tracking-in-steady-state) 3. [11.5.2.3. Tracking phase](#11523-tracking-phase) 4. [11.5.2.4. Syncing phase](#11524-syncing-phase) + 3. [11.5.3. Multi-channel problem](#1153-multi-channel-problem) + 1. [11.5.3.1. Per-flow version number](#11531-per-flow-version-number) 6. [11.6. Flow re-simulation support](#116-flow-re-simulation-support) 12. [12. Debuggability](#12-debuggability) 1. [12.1. ENI leak detection](#121-eni-leak-detection) @@ -1462,7 +1465,7 @@ Once the HA pair starts to run as standalone setup, the inline sync will stop wo 1. New flows can be created on one side, but not the other. 2. Existing flows can be terminated on one side, but not the other. -3. Existing flows can be aged out on one side, but not the other, depending on how we manage the lifetime of the lows. +3. Existing flows can be aged out on one side, but not the other, depending on how we manage the lifetime of the flows. 4. Due to policy updates, the same flow might get different packet transformations now, e.g., flow resimulation or flow recreation after policy update. And during recovery, we need to merge these 2 sets of flows back to one using "[bulk sync](#115-bulk-sync)". @@ -1880,6 +1883,24 @@ Whenever any flow is created or updated (flow re-simulation), update the flow ve 4. Handle bulk sync done event from ASIC, which will be sent after all flow change events are notified. 5. Call bulk sync completed SAI API, so ASIC can delete all tracked flow deletion records. Also reset `ToSyncFlowVerMin` and `ToSyncFlowVerMax` to 0, because there is nothing to sync anymore. +#### 11.5.3. Multi-channel problem + +During bulk sync, there would be two sync channels now: inline sync and bulk sync. As the 2 channels work independently, if a flow uses both channels to sync states from active to standby, the sync messages received by standby may be out-of-order and thus cause problems. + +The following illustration demonstrates one problematic case: the inline sync first writes a newer state to standby data plane and then bulk sync writes an older state. Finally, the synchronized state in the standby is the older state, rather than the desired newer one. + +

Out of order in bulk sync

+ +##### 11.5.3.1. Per-flow version number + +Per-flow version number algorithm is proposed to solve the issue. + +The algorithm is to attach a per-flow-wise unique version number to a flow’s every state. Therefore, the standby can decide which state is newer based on the unique version number. + +The timing graph of per-flow version number algorithm is illustrated below. When the standby receives the older state with version number X, it will reject it as the local stored version number of the flow is X + 1 which is greater than X, meaning that the current state is newer. + +

Per-flow version number

+ ### 11.6. Flow re-simulation support When certain policies are updated, we will have to update the existing flows to ensure the latest policy takes effect. This is called "flow re-simulation".