-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance observations when running NF's on local and/or remote NUMA node in a server #285
Comments
To add to this, Can you please give any suggestions on how to solve this problem and also how I can send the packets to the corresponding rx_mgr buffer of a thread based on the port->id(0 or 1) from which the packets are being received so that port-0 traffic is sent to rx-buffer-0 and port-1 traffic is sent to rx-buffer-1 and then the corresponding action on packet is performed. Thanks |
Hi @Balaram6712 , I'm an ONVM user as well, the following comments could not be fully correct.
True, remote node doesn't have bound with standalone NIC, NUMA distance matters. This is because DPDK pass-through the kernel and packets arrive LLC directly, in your case is Node-0's LLC. So when the Node-1 wants to process packets, it must retrieve the packets from Node-0, which hurts performance. The aforementioned guess can be verify via PCM, could you please check L3 cache hit rate, memory bandwith change between 2 local & 1-local-1-remote. (I don't have any NUMA server 😭)
I think the problem is resource contention, please inspect PCM metrics. BTW, do you make sure that NFs are not overload? (0 rx_drop in all your cases)
I try to alter But I don't think it's the problem of both 4.5 Gbps for each NF.
In my opinion, the easiest way is to add another in Have a nice weekend & experiment! 💪 |
@Balaram6712 - Thanks for your questions and sorry we are a little slow to respond right now.
The changes you have made are good. Make sure that you have
In general we have not looked in detail at NUMA issues for ONVM. Ideally you will want to keep packets fully processed on one socket. To do that you will need 2 RX threads (one per socket) and 2 TX threads (one per socket). I don't recall if our manager core assignment logic is smart enough to try to distribute those across sockets. My guess is that right now, the TX thread is becoming the bottleneck as it is getting packets from both the local and remote nodes to send out. You may to start a second TX thread and check that it is assigned to handle the NFs from the second NUMA socket. @JackKuo-tw - yes you are correct that |
@JackKuo-tw Thanks for the reply, I will get back to you on the Yes for the remote node the performance drops, what I am trying to solve here is that when one NF is moved to remote node it is affecting the performance on the local NF as well.
Yes I observed rx_drop = 0 in all the cases
I printed I observed that only Thread-1 receiving the packet from port-0,1 and storing in the corresponding buffer of Thread-1 and Thread-0 shows value zero every time for
Even i observed this by running pqos in the system only 1 RX thread seems to be running, I guess that this can be referred to Thread-0 not receiving packets.
I added the new function @twood02 Thanks for the reply,
I added an extra TX thread as well and assigned each TX thread corresponding to each NF and observed an improvement in the throughput
But the problem still here exists affecting the performance of the local NF when other NF is moved to remote Node, i guess the problem still is near multiple RX threads where only Thread-1 is receiving the packets from both the ports, can you suggest how we can solve this problem so that both the threads will be able to receive the packets from both the ports and I can steer the packets accordingly.
Can you suggest how we can do this so that a RX thread can collect packets from the port which it is assigned to. Thanks |
I can respond to other parts later, but quickly:
This should be an easy change in the rx_thread for loop. Currently it iterates through all ports and then bursts from a queue based on the RX thread ID number. Instead you could have a for loop for all queues (possibly you only have 1 queue, especially if you are using virtual NICs), and then change:
to
|
After the changes u suggested on assigning RX thread to a port i observe full throughput in local node and remote node max(6.5 Gbps each), but there is problem when running one NF in local and other in remote node, the local NF throughput is getting effected. For example when I am sending 9Gbps to both NF which are placed in local and remote node repectively, the local NF throughput is 7Gbps and remote NF is 6.5Gbps which is not expected as local NF is expected to produce full throughput. I presume that it is due to 1 queue in use. I tried to debug by adding some print statements of port_id and packets count, observed that count of packets I know it's been a long time looking into it, can you please help me out here would be very helpful |
I want to be sure I fully understand your setup now. Is this all accurate?
What exactly do you mean by local/remote here? You should be having each NF get packets from its "local" RX thread (i.e., the one on the same NUMA node) |
Yes
The 2 ports are connected to local NUMA node.
Yes
local NF gives 7Gbps and remote NF gives 6.5Gbps Our setup is such that we have 2 NUMA nodes, OpenNetVM manager is run on local node connected to 2 ports. The NF is either placed in local NUMA node or remote NUMA node.
Have a look at the detailed setup explanation as stated before. Thank you!! |
@Balaram6712 Thanks for the clarification. When you say:
I assume you mean that the main ONVM manager and first RX thread are started on the local node, but that there is also a manager RX thread running on the remote node. Are your NFs sending the packets back out or dropping them after doing some processing? The bottleneck could be the TX thread(s) if you are having the NFs send out the port. |
No ONVM Manager and 2RX threads are run in local node. The 2 RX threads collect the packets correspondingly from the 2 ports, the RX thread which corresponds to the NF in remote node collects packets and sends to NF. The overhead of packet from local to remote which causes to produce less throughput when NF when placed in remote node but this should not effect local NF throughput.
I am sending the packets through out port after processing them by NF.
There are 2TX threads corresponding to each NF which handle the packets and send them out through port. I don't think this will be a bottleneck as we are assigning corresponding threads to handle packets from NF, but the NF in local node results in lesser throughput. If the bottleneck is near TX threads for sending packet out the port, but in a scenario when both NF are placed in local node I am able to observe full throughput from both NF. I don't think problem is due to TX thread and sending packets out the port. I guess it is due to RX thread and queue which is causing this problem to produce lesser throughput for NF which is in local node in this scenario. Please let me know if I am able to convey my problem correctly. Thank you!! |
Respected sir,
Please find the following observations and couple of queries on how we can improve the throughput in such cases.
Experimental setup:
We run OpenNetVM on a server which is a two-node (NUMA) server with 24 cores (12 cores per each node), 96GB memory, and one dual-port 10G NICs. We use DPDK Pktgen to generate traffic, which runs on another server which has the same configuration as the previous one. Both servers run Ubuntu 18.04. Two NICs are located on Node-0 (we call it as local node). Node-1 is called remote node. Two servers are connected in back-to-back manner using dual-port 10G NICs.
On top this setup we run multiple following experiments
#define RTE_MP_RX_DESC_DEFAULT 8192, #define RTE_MP_TX_DESC_DEFAULT 8192
Query 1: could you suggest any other approaches to get full throughput i.e 18 Gbps (for traffic rate 18 Gbps) when a NF runs in a local node?
Query 2: When we run two NFs in a local node, we have observed 7 Gbps throughput for each NF, whereas when we run one NF in local and another NF in remote node, we have observed 4.5 Gbps throughput for each. Here, it is surprising that throughput of NF which is running in local node is also getting 4.5 Gbps instead of 7 Gbps which we observed when both NFs were running in local node. Could you please suggest any changes to improve throughput?
Thanks
The text was updated successfully, but these errors were encountered: