Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add many files in batches #66

Closed
wants to merge 5 commits into from
Closed

Conversation

jsarchibald
Copy link

This is a solution for #40 using the batching suggestion.

Justification:
"On a 32-bit Linux, this is ARGMAX/4-1 (32767). This becomes relevant if the average length of arguments is smaller than 4." -- https://www.in-ulm.de/~mascheck/various/argmax/

"On my OSX 10.5 machine (and most other BSD systems) it's 262144" -- https://serverfault.com/questions/69430/what-is-the-maximum-length-of-a-command-line-in-mac-os-x

Thus, this change adds files in batches of 32,700 (nice number, gives plenty of space for other arguments if needed, also pretty large so shouldn't slow the software excessively).

@kzidane
Copy link
Member

kzidane commented Aug 10, 2020

I think this is about the size of the command line not the number of arguments? i.e., you could still get Argument list too long even if you provide just one argument that's long enough.

If that's indeed the case, would it make sense to run getconf ARG_MAX to get the size of the max command-line first then batch add the files where the file names in each batch are <= that size?

@jsarchibald
Copy link
Author

I think this is about the size of the command line not the number of arguments? i.e., you could still get Argument list too long even if you provide just one argument that's long enough.

If that's indeed the case, would it make sense to run getconf ARG_MAX to get the size of the max command-line first then batch add the files where the file names in each batch are <= that size?

That makes sense to me.

@jsarchibald
Copy link
Author

Made changes to base this on getconf ARG_MAX value @kzidane

_run(git(f"add -f {' '.join(shlex.quote(f) for f in included)}"))
# Get the command line length limit, if available
try:
cl_limit = int(_run("getconf ARG_MAX")) // 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you dividing by 16 here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what worked, experimentally. ARG_MAX delivers a value much higher than what actually works. Dividing by 16 is where that changes (dividing by 15 causes the error still).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm we should probably figure out why. Does _run run the given command in a shell? Does the shell have a different limit maybe? Is there a way to get that limit if so?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some research that wasn't very fruitful, but will do some more and let you know if I find anything useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run runs the command in a pexpect.spawn call, which calls PtyProcess.spawn, which calls os.execv or os.execvpe, which essentially just calls the command line execve. So it seems to be an OS error, not anything related to how _run is written. This is also supported based on an experiment in which I generated commands and then ran them in an Ubuntu terminal.

Since the value that actually breaks Ubuntu command calls is (seemingly) arbitrary and not directly related to the value from getconf ARG_MAX (adding or subtracting a few characters after the division by 16 doesn't have an effect on whether the command can run) and this problem probably isn't due to the size of the environment (dividing ARG_MAX by 15 should certainly handle even really long envs, and ours, in testing, is much smaller than that would allow for -- I checked) I doubt the value of ARG_MAX at least for Ubuntu systems. I propose coming up with an arbitrary integer here and sticking to it (maybe 32768 to be safe?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This site literally recommends calling exec with longer and longer argument lists until it fails as the most reliable way. Yikes. Environment variables also count towards ARG_MAX.

https://www.in-ulm.de/~mascheck/various/argmax/

POSIX defines the minimum ARG_MAX to be 4096, we could just use that conservatively?

Also, this limit is in bytes, right? That might be the reason you are dividing by 16, the average length of the paths you are using is ~16 characters. That makes things a little more complicated to split. (the length of arguments now matters).

lib50/_api.py Outdated Show resolved Hide resolved
lib50/_api.py Outdated Show resolved Hide resolved
lib50/_api.py Outdated Show resolved Hide resolved
lib50/_api.py Outdated Show resolved Hide resolved
except:
cl_limit = None

# If no CL limit available, add all the files
Copy link
Contributor

@cmlsharp cmlsharp Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't an easier way to do this just be to have a function that breaks a list up into

e.g.

def chunked(xs, n):
    it = iter(iterable)
    while True:
       chunk = tuple(itertools.islice(it, n))
       if not chunk:
           return
       yield chunk

although, you can actually shorten this to:

def chunked(xs, n):
    """ Split xs in into chunks of size n """
    it = iter(xs)
    return iter(lambda: tuple(itertools.islice(it, n)), ())

(you can also get a version of this function from the more_itertools package)

In either case, then you just iterate over chunks directly. Something like

try:
    arg_limit = ...
except Exception:
    arg_limit = len(included)

for chunk in chunked(included, arg_limit):
    _run(git(f"add -f {' '.join(shlex.quote(f) for f in chunk)}"))

Although, as said above, I think ARG_MAX refers to bytes, not number of arguments so chunk would have to be a bit more complex. Something like:

def chunked(xs, n):
    chunk = []
    nbytes = 0
    for x in xs:
        xlen = len(bytes(x, encoding="utf8"))
        if xlen > n:
            raise ValueError("Length of string is larger than chunk size")

        if nbytes + xlen > n:
            yield chunk
            nbytes = 0
            chunk = []

        chunk.append(x)
        nbytes += xlen
    yield chunk

_run(git(f"add -f {' '.join(shlex.quote(f) for f in included)}"))
# Get the command line length limit, if available
try:
cl_limit = int(_run("getconf ARG_MAX")) // 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This site literally recommends calling exec with longer and longer argument lists until it fails as the most reliable way. Yikes. Environment variables also count towards ARG_MAX.

https://www.in-ulm.de/~mascheck/various/argmax/

POSIX defines the minimum ARG_MAX to be 4096, we could just use that conservatively?

Also, this limit is in bytes, right? That might be the reason you are dividing by 16, the average length of the paths you are using is ~16 characters. That makes things a little more complicated to split. (the length of arguments now matters).

@rongxin-liu rongxin-liu deleted the branch cs50:develop August 8, 2021 05:19
@rongxin-liu rongxin-liu closed this Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants