Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

relgukxilef · 2024-12-05T12:41:20Z

Hello, I'm trying to convert certain files in my repository from one format to another. I wrote some python code to accomplish this and am passing it to git-filter-repo's --blob-callback argument.

This seems to be working for a few thousand commits, then fast-import crashes with the message fatal: cannot truncate pack to skip duplicate: Invalid argument and writes a file fast_import_crash.

I've tried this multiple times, with different callbacks, with the repository filtered to different paths, filtering by path first, then applying my blob callback on a second run of git-filter-repo. The exact blob that it stops at differs between runs, but it always crashes at a blob that is much larger than the other ones. Above 600,000,000 bytes. Perhaps this is a known or intentional limitation of git fast-import or git-filter-repo.

  get-mark :655640
  blob
  mark :655641
  data 1507406
  blob
  mark :655642
  data 1558
  blob
  mark :655643
  data 865875
  blob
  mark :655644
* data 684724504

I have attached one such fast_import_crash file, but I have removed file and branch names, as this is a company repository.
fast_import_crash_30556.zip

The text was updated successfully, but these errors were encountered:

relgukxilef · 2024-12-05T14:02:24Z

I have tried running git-filter-repo with --blob-callback return and this finishes without issues. I have tried returning conditionally when either the input or the output of the conversion is larger than 1mb (far less than the last blob listed in fast_import_crash), but it still crashes at a 600mb blob, even though it shouldn't do anything with it. Perhaps having already updated some blobs, it fails to handle the large blob even when just passing it along?

newren · 2024-12-18T05:15:12Z

The error in your file is

fatal: cannot truncate pack to skip duplicate: Invalid argument

This message comes from the following git code:

static void truncate_pack(struct hashfile_checkpoint *checkpoint)
{
        if (hashfile_truncate(pack_file, checkpoint))
                die_errno("cannot truncate pack to skip duplicate");
        pack_size = checkpoint->offset;
}

and hashfile_truncate() in particular is this code:

int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
{
        off_t offset = checkpoint->offset;

        if (ftruncate(f->fd, offset) ||
            lseek(f->fd, offset, SEEK_SET) != offset)
                return -1;
        f->total = offset;
        the_hash_algo->clone_fn(&f->ctx, &checkpoint->ctx);
        f->offset = 0; /* hashflush() was called in checkpoint */
        return 0;
}

It's not clear whether it's ftruncate() or lseek() that is returning an error in your case, but the fact that you have files measuring 600 MB or more, suggests you might be getting close to either a 2GB or 4GB limit. If that's true, there's a possibility that switching platforms might help (due to differences in sizes of long in C on different platforms). Or, maybe there's some code somewhere that is using int/long/unsigned/unsigned long instead of off_t and size_t. But I don't have an easy way to reproduce.

Could you report what platform you are on, and try a few other OSes? If that still fails, could you try to find a way to reproduce that others can duplicate?

relgukxilef · 2024-12-30T14:49:53Z

The error occurs on 64 bit Windows with git version 2.47.1.windows.1 and git-filter-repo version a40bce548d2c.
I have managed to reproduce the issue in a small repository. It occurs reliably when running the following script, which creates a repository in the working directory, creates a few commits with large files, then runs git-filter-repo. Pass the path to git-filter-repo as the first command line argument.

import subprocess, shutil, sys

shutil.rmtree(".git", ignore_errors=True)
subprocess.run(["git", "init"])

for i in range(5):
    with open("small", "wb") as out:
        out.write(bytes([i]))
        out.truncate(1024)

    with open("large", "wb") as out:
        out.write(bytes([i]))
        out.truncate(650 * 1024 * 1024)

    subprocess.run(["git", "add", "small", "large"])
    subprocess.run(["git", "commit", "-m", "commit %d" % i])

subprocess.run([
 sys.argv[1],
 "--force",
 "--blob-callback", """
if len(blob.data) == 1024:
    blob.data = b"Test" + blob.data[4:]
"""])

relgukxilef · 2025-01-04T01:36:23Z

I tried my script in WSL and the error did not occur. I was then able to filter my repo on WSL without issues. Thanks for the suggestion.
Nonetheless there appears to be a bug causing this issue on Windows. I don't know if this is a concern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

relgukxilef commented Dec 5, 2024

relgukxilef commented Dec 5, 2024

newren commented Dec 18, 2024

relgukxilef commented Dec 30, 2024

relgukxilef commented Jan 4, 2025

Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

Crash when running --blob-callback on blobs larger than ~600,000,000 bytes #616

Comments

relgukxilef commented Dec 5, 2024

relgukxilef commented Dec 5, 2024

newren commented Dec 18, 2024

relgukxilef commented Dec 30, 2024

relgukxilef commented Jan 4, 2025