Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectX: Review ConnectX-5 transmit/receive/forwarding performance on PCI-e Gen4 #1471

Open
eugeneia opened this issue Feb 23, 2022 · 1 comment
Labels

Comments

@eugeneia
Copy link
Member

Over the past weeks I have compiled this new benchmark report for current Mellanox ConnectX-5 cards on an Intel PCI-e Gen3 and an AMD EPYC PCI-e Gen4 system. This issue intends to follow up to the discussions in #1007 and #1013.

The benchmarks were run using the code in #1469.

I think there’s a lot to unpack in the report and I can’t explain all of it, but I would like to highlight some new things we learned.

Let’s start with what I think is the most interesting plot:

download (2)

  • It looks to me like the NIC generation is optimized for PCI-e gen4. Can’t really say for certain but I would assume if the Intel server had gen4 too it would achieve line rate at >200b packets.
  • The relationship between packet rate and packet size is not continuous but modal! Might be that we’re seeing plateaus/steps that correspond to some buffer combination/size selection for a given packet size within the device.

Next is a reproduction of the plot in #1007, comparing transmit rate for 64b packets by number of transmit queues:

download

  • For the Intel gen3 system we see a plot that matches the previous observations but is overall lower. Again, I think this might be because its a pci gen3 system and would be higher on an otherwise equivalent gen4 system.
  • The EPYC system looks good with >=5 queues. However there’s a huge cliff between 4 and 5 tx queues that baffled us and that we can’t really explain. Some mode change in the northbridge system? No clue!

Overall we found that the EPYC system could achieve the highest performance between our two test systems, but being more... complicated? Before we got any reasonable performance on the EPYC system we had to find and properly set a BIOS option:

Preferred IO bus = 81

Where 81 is the bus our ConnectX-5 card is installed in:

81:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
81:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

Since this option in the BIOS we have allows only one mutually exclusive value, I assume this would prevent using multiple 100G cards at the same time. Meh. You can see the effect of this BIOS option here:

download (3)

Finally, we also did some forwarding tests between the two systems. Here we have packetblaster (single core, 16 queue pairs) running on the Intel system and a simple L2 forwarding program (multi core, one queue pair per core/worker) running on the EPYC system, forwarding packets pack to the Intel system. We expect to be bottlenecked by packetblaster running on the Intel pci gen3 system, and we show maximum observed packetblaster performance for that system as a grey dashed line:

download (4)

  • Forwarding rate seems to track packetblaster rate albeit with an offset. Could be that this is because the packetblaster port has some extra overhead receiving the forwarded packets, and that we’d see slightly better perf if these were two pci gen4 systems.
@eugeneia
Copy link
Member Author

We have a branch that supports the CQE compression feature supported by current ConnectX cards. (eugeneia@0ea878f)

We can see that the feature works by observing PCI-e PMU counters, where we can confirm that the NIC needs significantly less PCI-e writes (to write-back CQEs to the driver) if enabled.

cqe-compression

We were however not yet able to see this improve receive performance:

source-sink-compression-balanced
source-sink-compression-aggressive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant