Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove big binary files #622

Open
rcannood opened this issue Nov 24, 2023 · 2 comments
Open

Remove big binary files #622

rcannood opened this issue Nov 24, 2023 · 2 comments

Comments

@rcannood
Copy link
Contributor

If we use BFG to remove all blobs larger than 1M, we can reduce the openpipeline repo from 200MiB to around 44MiB. We can probably reduce it even further if we set the threshold even lower. @DriesSchaumont WDYT?

$  git clone --mirror [email protected]:openpipelines-bio/openpipeline.git lfs_test.git
Cloning into bare repository 'lfs_test.git'...
remote: Enumerating objects: 397073, done.
remote: Counting objects: 100% (6019/6019), done.
remote: Compressing objects: 100% (2307/2307), done.
remote: Total 397073 (delta 3644), reused 5873 (delta 3512), pack-reused 391054
Receiving objects: 100% (397073/397073), 200.99 MiB | 5.97 MiB/s, done.
Resolving deltas: 100% (269042/269042), done.

$ java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 1M lfs_test.git

Using repo : /home/rcannood/workspace/openpipelines-bio/lfs_test.git


This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 1588292
Scanning packfile for large blobs completed in 6,443 ms.
Found 6 blob ids for large blobs - biggest=14395908 smallest=1521437
Total size (unpacked)=47515450
Found 443 objects to protect
Found 512 commit-pointing refs : HEAD, refs/heads/481-add-leiden-clustering-to-scvi-pipeline, refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references, ...
Found 4 tag-pointing refs : refs/tags/0.3.0, refs/tags/0.3.1, refs/tags/0.4.0, refs/tags/0.4.1

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 5fb2a9e0 (protected by 'HEAD')

Cleaning
--------

Found 4459 commits
Cleaning commits:       100% (4459/4459)
Cleaning commits completed in 3,003 ms.

Updating 156 Refs
-----------------

	Ref                                                                          Before     After   
	------------------------------------------------------------------------------------------------
	refs/heads/481-add-leiden-clustering-to-scvi-pipeline                      | 700bffd6 | 6d0b9eec
	refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references | 772769ee | 7abac021
	refs/heads/604-use-the-viash-dependencies-config-value-for-workflows       | 843009e8 | 8b7b78ba
	refs/heads/concat_dtypes                                                   | c8f1e5f8 | e92cbea4
	refs/heads/feature/ataq-demux                                              | 5dcebba7 | 1666af0f
	refs/heads/feature/ataq-qc                                                 | dde357ff | 98d64cbd
	refs/heads/feature/scpoli_implementation                                   | b17c3a84 | 3ee6bc23
	refs/heads/increase_ci_memory                                              | 1464e7aa | 9b6af876
	refs/heads/integration_build                                               | b225d951 | d1eaab7b
	refs/heads/main                                                            | 5fb2a9e0 | 56ac0431
	refs/heads/main_build                                                      | 8a9894a6 | cc0001cd
	refs/heads/main_build_datasets_schema                                      | 5022c403 | 901839ca
	refs/heads/more_memory_tests                                               | fe5188fa | 7608da95
	refs/heads/release                                                         | 98678513 | 0594ac36
	refs/heads/review_cellxgene                                                | f881710c | 475cecfc
	...

Updating references:    100% (156/156)
...Ref update completed in 38 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	..............................................DDDDDDDDDmmDmm

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After   
	-------------------------------------------
	First modified commit | 6455c1d6 | fae7b4ab
	Last dirty commit     | e27f9172 | 3ffb155c

Deleted files
-------------

	Filename                                                                  Git id            
	--------------------------------------------------------------------------------------------
	cellranger-tiny-bcl-1.2.0.tar.gz                                        | 4b3e7995 (13.4 MB)
	cl-base.obo                                                             | af96cc47 (1.5 MB) 
	matrix.mtx.gz                                                           | 9e469be2 (4.0 MB) 
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5                        | eade8772 (5.2 MB) 
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5ad                      | 145b611c (13.7 MB)
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.norm.hvg.pca.nn.umap.h5ad | de2901dd (7.6 MB) 


In total, 22327 object ids were changed. Full details are logged here:

	/home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-40-05

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ cd lfs_test.git

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 397073, done.
Counting objects: 100% (397073/397073), done.
Delta compression using up to 32 threads
Compressing objects: 100% (379869/379869), done.
Writing objects: 100% (397073/397073), done.
Selecting bitmap commits: 4368, done.
Building bitmaps: 100% (148/148), done.
Total 397073 (delta 268875), reused 124073 (delta 0), pack-reused 0

$ git push
Enumerating objects: 397073, done.
Writing objects: 100% (397073/397073), 44.70 MiB | 3.69 MiB/s, done.
Total 397073 (delta 0), reused 0 (delta 0), pack-reused 397073
remote: Resolving deltas: 100% (268875/268875), done.
@rcannood
Copy link
Contributor Author

If I set the threshold to 500K, I get:

$ java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 200K lfs_test.git
Using repo : /home/rcannood/workspace/openpipelines-bio/lfs_test.git


This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 794146
Scanning packfile for large blobs completed in 2,581 ms.
Found 2891 blob ids for large blobs - biggest=715168 smallest=216802
Total size (unpacked)=53673113
Found 443 objects to protect
Found 512 commit-pointing refs : HEAD, refs/heads/481-add-leiden-clustering-to-scvi-pipeline, refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references, ...
Found 4 tag-pointing refs : refs/tags/0.3.0, refs/tags/0.3.1, refs/tags/0.4.0, refs/tags/0.4.1

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 56ac0431 (protected by 'HEAD') - contains 3 dirty files : 
    - images/concepts/fig.svg (389.1 KB)
    - src/mapping/bd_rhapsody/rhapsody_targeted_1.10.1_nodocker.cwl (211.7 KB)
    - src/mapping/bd_rhapsody/rhapsody_wta_1.10.1_nodocker.cwl (212.8 KB)

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-49-03/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.
       

Cleaning
--------

Found 4459 commits
Cleaning commits:       100% (4459/4459)
Cleaning commits completed in 2,481 ms.

Updating 514 Refs
-----------------

    Ref                                                                          Before     After   
    ------------------------------------------------------------------------------------------------
    refs/heads/481-add-leiden-clustering-to-scvi-pipeline                      | 6d0b9eec | eb966355
    refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references | 7abac021 | 56f76331
    refs/heads/604-use-the-viash-dependencies-config-value-for-workflows       | 8b7b78ba | 75caa7e9
    refs/heads/automation                                                      | 9cd06207 | b857a87a
    refs/heads/concat_dtypes                                                   | e92cbea4 | 21942a5e
    refs/heads/feature/ataq-demux                                              | 1666af0f | 3792b762
    refs/heads/feature/ataq-qc                                                 | 98d64cbd | de89e1b7
    refs/heads/feature/cellranger_convert                                      | 951b5c99 | e43c8791
    refs/heads/feature/count_demultiplexing                                    | 6461edd3 | d5e1bd2f
    refs/heads/feature/refactor_velocyto                                       | 068ed30d | 76440a30
    refs/heads/feature/scpoli_implementation                                   | 3ee6bc23 | 7a4cbf9c
    refs/heads/feature/ts                                                      | bfd45792 | ddb86b6d
    refs/heads/fix_temp_var                                                    | b52db6ef | 9fb67514
    refs/heads/increase_ci_memory                                              | 9b6af876 | 299ad45f
    refs/heads/integration_build                                               | d1eaab7b | 0373d7e3
    ...

Updating references:    100% (514/514)
...Ref update completed in 117 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    .DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | a2af1a87 | 89e209f5
    Last dirty commit     | e1ddf9cd | b942e054

Deleted files
-------------

    Filename                                        Git id                                       
    ---------------------------------------------------------------------------------------------
    CS0000007_subsample_LI00080.csv.gz            | 449a7f3a (343.0 KB)                          
    features.tsv.gz                               | 1288f445 (297.6 KB)                          
    fig.svg                                       | 2a72f8e7 (389.1 KB)                          
    main.nf                                       | cfd3ebb5 (213.1 KB), 732e783e (252.9 KB), ...
    multi_star                                    | 76d7c752 (337.8 KB), b87f789d (335.4 KB), ...
    pbmc_1k_protein_v3_raw_feature_bc_matrix.h5   | 0d3a7789 (274.6 KB)                          
    pbmc_1k_protein_v3_raw_feature_bc_matrix.h5ad | 62aa4349 (698.4 KB)                          
    pipelines-target-p1.png                       | 1f658205 (292.0 KB), 5dc0174c (292.0 KB)     
    pipelines-target-p2.png                       | d9a7235a (300.7 KB), 55690133 (300.7 KB)     
    pipelines-target-p3.png                       | ec2cf53b (250.2 KB), ac65760d (245.8 KB), ...
    pipelines.svg                                 | 19ee6521 (278.9 KB), 16d12ddb (289.1 KB)     
    rhapsody_targeted_1.10.1_nodocker.cwl         | 56a6310b (211.7 KB)                          
    rhapsody_wta_1.10.1_nodocker.cwl              | 5fa9ea85 (212.8 KB)                          
    rhapsody_wta_1.10_nodocker.cwl                | c941c763 (212.3 KB)                          
    star_align                                    | 0df72d36 (308.0 KB), 4a1e589e (307.9 KB), ...
    star_align_v273a                              | e9182424 (308.3 KB), 39258580 (308.4 KB), ...


In total, 18874 object ids were changed. Full details are logged here:

    /home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-49-03

@DriesSchaumont
Copy link
Member

Does this edit the repository retroactively? If so, we should make sure to exclude release, main, main_build and all tags. Otherwise we could break older releases/runs. Is there a problem with having a large repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants