Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

Open
relgukxilef opened this issue Dec 5, 2024 · 4 comments

Comments

@relgukxilef
Copy link

Hello, I'm trying to convert certain files in my repository from one format to another. I wrote some python code to accomplish this and am passing it to git-filter-repo's --blob-callback argument.

This seems to be working for a few thousand commits, then fast-import crashes with the message fatal: cannot truncate pack to skip duplicate: Invalid argument and writes a file fast_import_crash.

I've tried this multiple times, with different callbacks, with the repository filtered to different paths, filtering by path first, then applying my blob callback on a second run of git-filter-repo. The exact blob that it stops at differs between runs, but it always crashes at a blob that is much larger than the other ones. Above 600,000,000 bytes. Perhaps this is a known or intentional limitation of git fast-import or git-filter-repo.

  get-mark :655640
  blob
  mark :655641
  data 1507406
  blob
  mark :655642
  data 1558
  blob
  mark :655643
  data 865875
  blob
  mark :655644
* data 684724504

I have attached one such fast_import_crash file, but I have removed file and branch names, as this is a company repository.
fast_import_crash_30556.zip

@relgukxilef
Copy link
Author

I have tried running git-filter-repo with --blob-callback return and this finishes without issues. I have tried returning conditionally when either the input or the output of the conversion is larger than 1mb (far less than the last blob listed in fast_import_crash), but it still crashes at a 600mb blob, even though it shouldn't do anything with it. Perhaps having already updated some blobs, it fails to handle the large blob even when just passing it along?

@newren
Copy link
Owner

newren commented Dec 18, 2024

The error in your file is

fatal: cannot truncate pack to skip duplicate: Invalid argument

This message comes from the following git code:

static void truncate_pack(struct hashfile_checkpoint *checkpoint)
{
        if (hashfile_truncate(pack_file, checkpoint))
                die_errno("cannot truncate pack to skip duplicate");
        pack_size = checkpoint->offset;
}

and hashfile_truncate() in particular is this code:

int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
{
        off_t offset = checkpoint->offset;

        if (ftruncate(f->fd, offset) ||
            lseek(f->fd, offset, SEEK_SET) != offset)
                return -1;
        f->total = offset;
        the_hash_algo->clone_fn(&f->ctx, &checkpoint->ctx);
        f->offset = 0; /* hashflush() was called in checkpoint */
        return 0;
}

It's not clear whether it's ftruncate() or lseek() that is returning an error in your case, but the fact that you have files measuring 600 MB or more, suggests you might be getting close to either a 2GB or 4GB limit. If that's true, there's a possibility that switching platforms might help (due to differences in sizes of long in C on different platforms). Or, maybe there's some code somewhere that is using int/long/unsigned/unsigned long instead of off_t and size_t. But I don't have an easy way to reproduce.

Could you report what platform you are on, and try a few other OSes? If that still fails, could you try to find a way to reproduce that others can duplicate?

@relgukxilef
Copy link
Author

The error occurs on 64 bit Windows with git version 2.47.1.windows.1 and git-filter-repo version a40bce548d2c.
I have managed to reproduce the issue in a small repository. It occurs reliably when running the following script, which creates a repository in the working directory, creates a few commits with large files, then runs git-filter-repo. Pass the path to git-filter-repo as the first command line argument.

import subprocess, shutil, sys

shutil.rmtree(".git", ignore_errors=True)
subprocess.run(["git", "init"])

for i in range(5):
    with open("small", "wb") as out:
        out.write(bytes([i]))
        out.truncate(1024)

    with open("large", "wb") as out:
        out.write(bytes([i]))
        out.truncate(650 * 1024 * 1024)

    subprocess.run(["git", "add", "small", "large"])
    subprocess.run(["git", "commit", "-m", "commit %d" % i])

subprocess.run([
 sys.argv[1],
 "--force",
 "--blob-callback", """
if len(blob.data) == 1024:
    blob.data = b"Test" + blob.data[4:]
"""])

@relgukxilef
Copy link
Author

I tried my script in WSL and the error did not occur. I was then able to filter my repo on WSL without issues. Thanks for the suggestion.
Nonetheless there appears to be a bug causing this issue on Windows. I don't know if this is a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants