-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: port collision on large amount of tests #257
Conversation
This seems a bit hacky. Why not just do something like bind to port zero and then the OS will assign you an unused port which you can find using local_addr? |
/// Acquire an unused port and lock it for the duration until the sandbox server has | ||
/// been started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was going to suggest the same thing as @DavidM-D, but it seems like this is the key difference. A potential race condition between two nodes seeing that a port is “free” at the same time, then proceeding to simultaneously attempt binding to it. And since there's no global node executor that controls the issuing of ports to nodes, there's no way to gate access to ports if workspaces instances can't communicate with themselves, which is probably why the common denominator used here is the filesystem.
let lockfile = File::create(lockpath).map_err(|err| { | ||
ErrorKind::Io.full(format!("failed to create lockfile for port {}", port), err) | ||
})?; | ||
if lockfile.try_lock_exclusive().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what create
does to a file that currently exists and is locked. The normal behaviour is to truncate existing files. I wonder if it errors at this point and forcing an failed return? Or would it truncate the file, (invalidating the previous lock *unlikely), causing this new lock attempt to fail and then trying a different port or would it skip truncation, yet, return a handle to the file for which we can attempt securing a lock.
I assume you've probably tested this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So lock files are purely advisory about a resource being taken up. They don't actually lock the file from being written to, so truncation wouldn't error out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha!
@ChaoticTempest is it also fixes #253? One of the solutions was timeout between requests and threads reducing. It was too slow for the test running. |
@ChaoticTempest As our research has shown, this statement is incorrect. This is typical for instances with a large amount of resources - namely the number of CPU cores. Which results in a large number of tests running at the same time. And since the ports are selected randomly, and the randomizer entropy is not of high quality, this leads to the fact that the probability of choosing an already open port increases dramatically. More correct: |
This one doesn't fix that issue particularly. That still needs to resolved in near/nearcore#8328
ahh, I was generalizing for the whole problem including the issue I mentioned above. But this one just fixes the fact that the RNG chooses very similar ports. The other issue should try to alleviate the patching state issues so that you can run this on as many threads as possible without hitting that issue |
Port collision would rarely happen, but happens pretty frequently on a machine with not as much resources to run a lot of nodes. This fix remedies this problem by adding a couple lockfiles to the ports until the server actually runs and fully acquires the port.
Not sure if this the best way of achieving this, so feel free to suggest anything different.