Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto IP function duplicates IP address #1139

Closed
shaxxx opened this issue Jul 24, 2023 · 20 comments
Closed

Auto IP function duplicates IP address #1139

shaxxx opened this issue Jul 24, 2023 · 20 comments

Comments

@shaxxx
Copy link
Contributor

shaxxx commented Jul 24, 2023

I have my supernode nice and running on Windows, with about 100 communities in community list.
Each community has subnet defined in community list, ie.

example1 10.0.10.0/24 #EXAMPLE 1
example2 10.0.10.0/24 #EXAMPLE 2

Each community has one server available 24/7, on predefined IP address set by edge client (ie. 10.0.10.1)
Other clients are dynamic, and don't have IP specified - they get it from supernode in range specified in community.list
Each edge sends it's machine name with -I option.
It's all was working perfectly, until recent problem when one of edge clients, let's call it XXX, in one specific community would get IP from supernode (ie. 10.0.10.15) but connection would not work and ipconfig /all would show problematic IP address marked as DUPLICATE.
Looking at output from supernode managment port, there is no clients with that IP.
If I keep disconnecting and connecting, edge client would eventually get some other IP and everything would work fine.
Currently I worked around it by randomizing information sent with -I option.

Now, this is where things get really interesting.
On this same subnet/community I have another client, lets call it YYY that usually gets IP allocated really close to the one from problematic client (IE. 10.0.10.16 or 10.0.10.17). And I believe this is the origin of the problem.
Whenever I check managment port it shows YYY connected, but NOT with actual IP address used in community subnet.
For example, not it shows 10.0.10.17, while actual address on TAP adapter of YYY is 10.0.10.16.
I can't ping 10.0.10.17 from any edge client, but everything is fine with 10.0.10.16
Keep in mind, YYY is connected every morning, has no issues at all, can work with server on 10.0.10.1 just fine all day.
And I can see it on managment port connected all the time but with invalid IP address (usually with offset +/- of one).

Conclusion, there is list of potential problems

  • auto dhcp fails to check for existing IP address properly
  • auto dhcp checks for existing IP properly but existing IP list of edges is not valid for some reason
  • managment is providing false data
  • I'm an idiot ?
@Logan007
Copy link
Collaborator

Logan007 commented Jul 24, 2023

Hey, thank you for describing the issue you discovered. Do you only use one supernode or do you use a federated supernode setup?

We do not use real DHCP but some other simple algorithm. As you pointed to the name string, that's exactly what we hash into an IP address, and go up/downwards in case it is already taken. This does not work in a distributed way yet.

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 24, 2023

No, I'm using single supernode, but with multiple communities, and like I said every community defines it's subnet in community.list and this is the first one with the problem (more than 100 communities).

Yeah, I did went trough PRs and code to see it's Pearson hash, but didn't really had any idea how to properly debug it since this runs in production and I can't simply take it offline for the sake of debugging. Also, I've tried verbose debug output on the client, but couldn't find anything that sticks out of normal.

If you need more info let me know.

@Logan007
Copy link
Collaborator

I currently am a bit far away from the code, so I am guessing here until I get another chance to take a deeper look. But, what comes to mind is MAC address. As far as I remember, MAC address is assigned randomly if not specified by -m. Maybe some other computer remembers same IP address having previous MAC and communicates a duplicate or so? (Thinking of some ARP issue, but again, not sure here).

So, one thing I would try for testing is a fixed MAC address using -m at that edge.

@Logan007
Copy link
Collaborator

Logan007 commented Jul 24, 2023

Note to myself: also check the code for gratuitous ARP sent by edges with auto-IP on.

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 24, 2023

I will try with fixed mac address, but keep in mind I already tried removing all TAP adapters multiple times, as well as reinstalling TAP driver which changed MAC address of used adapter multiple times.
Also I was so desperate and restarted (shutodown, then start) supernode once, and it helped, but only until the next morning.
My supernode conf is

-p=7654
-c community.list
-F cloud
-M

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 24, 2023

I don't think it's ARP problem.
Edge client is connected for almost entire day, problem is it get's one address from supernode (10.0.20.16) while in the same time management port reports it's using another one (10.0.20.17). There is no 10.0.20.16 in management port info, while it says 10.0.20.17 is connected. If I kill the edge management port will stop listing 10.0.20.17 as online (time will not be updated to current anymore).
If I reconnect, it's the same thing, gets one address, reports the other in the management port.
Here's the non verbose log from client

24/Jul/2023 06:12:36 [edge_utils.c:3236] adding supernode = n2n.XXXX.XXX:7654
24/Jul/2023 06:12:36 [edge.c:1029] WARNING: switching to AES as key was provided
24/Jul/2023 06:12:36 [edge.c:1069] starting n2n edge FIXME Tue Jun 6 20:43:42 CEDT 2023
24/Jul/2023 06:12:36 [edge.c:1075] using compression: none.
24/Jul/2023 06:12:36 [edge.c:1076] using AES cipher.
24/Jul/2023 06:12:36 [edge_utils.c:392] number of supernodes in the list: 1
24/Jul/2023 06:12:36 [edge_utils.c:394] supernode 0 => n2n.XXXX.XXX:7654
24/Jul/2023 06:12:36 [edge_utils.c:483] successfully created resolver thread
24/Jul/2023 06:12:36 [edge.c:1106] automatically assign IP address by supernode
24/Jul/2023 06:12:36 [edge.c:1178] send REGISTER_SUPER to supernode [n2n.XXXX.XXX:7654] asking for IP address
24/Jul/2023 06:12:36 [edge.c:1189] received REGISTER_SUPER_ACK from supernode for IP address asignment
Open device [name={48A2FBDB-162F-4DA7-AA22-2F78A96E910A}][ip=10.0.20.16][ifName=OpenVPN][MTU=1290][mac=00:FF:48:A2:FB:DB]
24/Jul/2023 06:12:36 [edge.c:1215] created local tap device IP: 10.0.20.16, Mask: 255.255.255.0, MAC: 00:FF:48:A2:FB:DB
24/Jul/2023 06:12:36 [edge.c:1308] edge started
24/Jul/2023 06:12:36 [edge_utils.c:1169] successfully joined multicast group 224.0.0.68:1968
24/Jul/2023 06:12:37 [edge_utils.c:2569] [OK] edge <<< ================ >>> supernode

@Logan007
Copy link
Collaborator

Okay, thanks for the report then. I will have a look when I get to it, more likely weeks than days. Did I understand correctly that it normally would be assigned the 15, and 16 already is the counted-up address?

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 25, 2023

I think it's actually count down address.
Had time to bring up local supernode with same community.list and two edge nodes from virtual machine with same information used in -I option. I was able to reproduce that pearson hash for both client is the same 10.0.20.17 (before counting down or up).
But did not had time to debug any further. I can send you client info strings if you want to reproduce it localy to test.

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 26, 2023

Good news, I was able to get 100% reproducibility on local machine.
Steps to reproduce:

  • run supernode with supernode.conf and community.list attached in zip
  • run 1st edge (let's call it ZUPKA-KLIJENT) with zupka-klijent.conf attached in zip, edge will get IP 10.0.20.17 from supernode
  • run 2nd edge (let's call it SERVER2022) with server2022.conf attached in zip. Pearson hash on supernode will calculate 17 as first proposal, but ip_addr_available function will detect existing edge with same IP, so after going one downwards 2nd edge will get IP 10.0.20.16. All good so far.
  • kill 1st edge (ZUPKA-KLIJENT) with double CTRL+C
  • run 1st edge (ZUPKA-KLIJENT) again. Supernode will detect existing 10.0.20.17 and 10.0.20.16 edge clients, and after going downwards two times from original Pearson hash of 17, 1st edge (ZUPKA-KLIJENT) will get 10.0.20.15
  • list connected edges from supernode management port, you will see two edge clients, SERVER2022 with IP address 10.0.20.16 (correct), and ZUPKA-KLIJENT with IP address 10.0.20.17. (INCORRECT)
  • kill 2nd edge (SERVER2022) with double CTRL+C
  • run 2nd edge (SERVER2022) again. Supernode will detect existing 10.20.0.17 and 10.0.20.16 edge clients, but will not detect 10.0.20.15. So after after going downwards two times from original Pearson hash of 17, 2nd edge (SERVER2022) will get also 10.0.20.15 IP address and windows ipconfig will mark it correctly as duplicate.
  • leave both edges and supernode running for few minutes and list connected edges from management port. You'll notice that ZUPKA-KLIJENT is still online, but reports invalid IP address 10.0.20.17 instead of actual working 10.0.20.15

Conclusion: supernode does not takes in account changes in IP address in the list of connected edges which leads to wrong information on managment port and duplicated IP's on edge clients.

You will also find verbose logs from supernode and both edge clients (first and second start) attached in the zip file.

debug.zip

duplicate

@Logan007
Copy link
Collaborator

Logan007 commented Jul 26, 2023

Thank you very much for the detailed description! It will be an extremely good starting point for debugging. It could be related to the edge not properly de-registering (does it work better when you wait with restarting until the edge data is purged at supernode, around 90 to120 seconds after you quit the edge?) but also definitely includes an additional underlying bug.

I will take care of it as soon as I can, not before in three weeks though.

@shaxxx
Copy link
Contributor Author

shaxxx commented Jul 28, 2023

Currently when existing edge is matched by mac address there is additional check if socket has change, and if it has, then updates existing edge with new socket info.

Following this logic, I think it's safe in this case to update it's device address also.
I added one line after socket update

memcpy(&(scan->dev_addr), &(reg->dev_addr), sizeof(n2n_ip_subnet_t));

I tested again, no duplicate IP's, and management port displays valid info.
But you'll know better if my logic is reasonable enough for other use cases.

@Logan007
Copy link
Collaborator

Looks extremely good to me, the update of the IP address will help auto-logic to always have an up-to-date list of actually occupied IP addresses. I currently do not remember how we transmit and handle not-automatically assigned IP addresses I and hope it does not conflict, but yes, your approach looks extremely reasonable. Only thing, not sure if it needs some checks before copying.

We obviously never thought of this case (IP address of node changed) in the update path. Good catch!

@Logan007
Copy link
Collaborator

Do you want to test for a while and then provide a pull request?

@shaxxx
Copy link
Contributor Author

shaxxx commented Aug 1, 2023

I just tried to pull the latest changes to make PR and it seems this was fixed by you 2 years ago.
Thought I was going crazy, I forked the repo on April 27th, and there is no IP address copy part in my fork
But in current, repo, there it is, just like it's supposed to be.
Not sure how this is possible, Git blame says it's2 years old.

@Logan007
Copy link
Collaborator

Logan007 commented Aug 1, 2023

So the bug has already been fixed? Does that revised version work for you?

@shaxxx
Copy link
Contributor Author

shaxxx commented Aug 1, 2023

Nah, I was looking at the wrong place, IP address info was updated in the part of code run when new edge is connected.
I've simply copied that part to the place when socket change was detected.
I'll push it to production tonight, and give it a week or so to see if there's any problems, and then come back to you.
Thank you for your support!

@Logan007
Copy link
Collaborator

Logan007 commented Aug 1, 2023

Thank you for actively contributing to development!

@hamishcoleman
Copy link
Collaborator

Thanks from me too.

@Logan007 - should we backport this bugfix to the stable branch?

@shaxxx shaxxx changed the title Auto DHCP function duplicates IP address Auto IP function duplicates IP address Aug 1, 2023
@Logan007
Copy link
Collaborator

Logan007 commented Aug 1, 2023

Sounds reasonable, as far as I can see it does not break compatibility.

@shaxxx
Copy link
Contributor Author

shaxxx commented Aug 28, 2023

Fixed with 1142

@shaxxx shaxxx closed this as completed Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants