-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add many files in batches #66
Conversation
I think this is about the size of the command line not the number of arguments? i.e., you could still get If that's indeed the case, would it make sense to run |
That makes sense to me. |
Made changes to base this on |
_run(git(f"add -f {' '.join(shlex.quote(f) for f in included)}")) | ||
# Get the command line length limit, if available | ||
try: | ||
cl_limit = int(_run("getconf ARG_MAX")) // 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you dividing by 16 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what worked, experimentally. ARG_MAX
delivers a value much higher than what actually works. Dividing by 16 is where that changes (dividing by 15 causes the error still).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm we should probably figure out why. Does _run
run the given command in a shell? Does the shell have a different limit maybe? Is there a way to get that limit if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some research that wasn't very fruitful, but will do some more and let you know if I find anything useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_run
runs the command in a pexpect.spawn
call, which calls PtyProcess.spawn
, which calls os.execv
or os.execvpe
, which essentially just calls the command line execve. So it seems to be an OS error, not anything related to how _run
is written. This is also supported based on an experiment in which I generated commands and then ran them in an Ubuntu terminal.
Since the value that actually breaks Ubuntu command calls is (seemingly) arbitrary and not directly related to the value from getconf ARG_MAX
(adding or subtracting a few characters after the division by 16 doesn't have an effect on whether the command can run) and this problem probably isn't due to the size of the environment (dividing ARG_MAX
by 15 should certainly handle even really long envs, and ours, in testing, is much smaller than that would allow for -- I checked) I doubt the value of ARG_MAX
at least for Ubuntu systems. I propose coming up with an arbitrary integer here and sticking to it (maybe 32768 to be safe?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This site literally recommends calling exec
with longer and longer argument lists until it fails as the most reliable way. Yikes. Environment variables also count towards ARG_MAX.
https://www.in-ulm.de/~mascheck/various/argmax/
POSIX defines the minimum ARG_MAX to be 4096, we could just use that conservatively?
Also, this limit is in bytes, right? That might be the reason you are dividing by 16, the average length of the paths you are using is ~16 characters. That makes things a little more complicated to split. (the length of arguments now matters).
except: | ||
cl_limit = None | ||
|
||
# If no CL limit available, add all the files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't an easier way to do this just be to have a function that breaks a list up into
e.g.
def chunked(xs, n):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
although, you can actually shorten this to:
def chunked(xs, n):
""" Split xs in into chunks of size n """
it = iter(xs)
return iter(lambda: tuple(itertools.islice(it, n)), ())
(you can also get a version of this function from the more_itertools
package)
In either case, then you just iterate over chunks directly. Something like
try:
arg_limit = ...
except Exception:
arg_limit = len(included)
for chunk in chunked(included, arg_limit):
_run(git(f"add -f {' '.join(shlex.quote(f) for f in chunk)}"))
Although, as said above, I think ARG_MAX
refers to bytes, not number of arguments so chunk would have to be a bit more complex. Something like:
def chunked(xs, n):
chunk = []
nbytes = 0
for x in xs:
xlen = len(bytes(x, encoding="utf8"))
if xlen > n:
raise ValueError("Length of string is larger than chunk size")
if nbytes + xlen > n:
yield chunk
nbytes = 0
chunk = []
chunk.append(x)
nbytes += xlen
yield chunk
_run(git(f"add -f {' '.join(shlex.quote(f) for f in included)}")) | ||
# Get the command line length limit, if available | ||
try: | ||
cl_limit = int(_run("getconf ARG_MAX")) // 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This site literally recommends calling exec
with longer and longer argument lists until it fails as the most reliable way. Yikes. Environment variables also count towards ARG_MAX.
https://www.in-ulm.de/~mascheck/various/argmax/
POSIX defines the minimum ARG_MAX to be 4096, we could just use that conservatively?
Also, this limit is in bytes, right? That might be the reason you are dividing by 16, the average length of the paths you are using is ~16 characters. That makes things a little more complicated to split. (the length of arguments now matters).
This is a solution for #40 using the batching suggestion.
Justification:
"On a 32-bit Linux, this is ARGMAX/4-1 (32767). This becomes relevant if the average length of arguments is smaller than 4." -- https://www.in-ulm.de/~mascheck/various/argmax/
"On my OSX 10.5 machine (and most other BSD systems) it's 262144" -- https://serverfault.com/questions/69430/what-is-the-maximum-length-of-a-command-line-in-mac-os-x
Thus, this change adds files in batches of 32,700 (nice number, gives plenty of space for other arguments if needed, also pretty large so shouldn't slow the software excessively).