Skip to content
This repository has been archived by the owner on Aug 19, 2024. It is now read-only.
/ superblock Public archive

A block oriented training approach for inference time optimization.

License

Notifications You must be signed in to change notification settings

pytorch-labs/superblock

Repository files navigation

SuperBlock has now been transferred to torchao repo here.

SuperBlock

SuperBlock combines two techniques for efficient neural network training and inference: Supermask and Block Compressed Sparse Row (BSR). The techniques are described in this blog post.

Supermask

Supermask is a technique for applying structured sparsity to neural networks using a learned mask. It works by learning a continuous mask (scores) that is applied element-wise to the weights of a neural network layer. The mask scores are learned separately from the weights and are thresholded based on a target sparsity level to obtain a binary mask. The mask determines which weigths are kept and which are pruned, and is learned during training.

During inference, the binary mask is applied element-wise to the weights, pruning the weights that correspond to a 0 in the mask, resulting in a sparse network that can be efficiently computed.

Block compressed Sparse Row Format (BSR)

The BSR format is a sparse matrix representation that stores dense sub-blocks of non-zero elements instead of individual non-zero elements. The matrix is divided into equal-sized blocks, and only the non-zero blocks are stored.

The BSR format is efficient for sparse matrices with a block structure, where non-zero elements tend to cluster in dense sub-blocks. It reduces storage requirements and enables efficient matrix operations on the non-zero blocks.

Currently, the BSR format is optimized for Nvidia A100 GPU(s) only.

Setup

To use SuperBlock, you will need

To train the model or evaluate accuracy, you will need:

  • ImageNet2012-blurred dataset

At least one GPU:

  • A100 or H100

Installation

  • Clone this repo
    git clone https://github.com/pytorch-labs/superblock.git
    cd superblock
    
  • Create a new conda environment
    conda create -n superblock
    conda activate superblock
    
  • Install PyTorch. For best performance, we recommend 2.3.0.dev20240305+cu121 nightly
    pip install --pre torch==2.3.0.dev20240305+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
    pip install --pre torchvision==0.18.0 --no-deps
    

Benchmarking

Baseline:

python benchmark.py \
  --model vit_b_16 \
  --batch-size 256 \
  > /dev/null

Result:

532.1160546875 ms

80% sparsity, block size 64 (random weights):

python benchmark.py --model vit_b_16 \
  --batch-size 256 \
  --sparsity-linear 0.8 \
  --sp-linear-tile-size 64 \
  --sparsify-weights \
  --bsr 64 \
  > /dev/null

Result:

393.864453125 ms

Training

Please refer to TRAINING.md for training from scratch. We use Torchvision as our framework for training. Supermask can be applied during training.

To apply supermask, we have the following arguments at our disposal,

  • Apply Supermask to linear layers:
    --sparsity-linear
    --sp-linear-tile-size
    
  • Apply Supermask to conv1x1 layers:
    --sparsity-conv1x1
    --sp-conv1x1-tile-size
    
  • Apply Supermask to all other convolutional layers:
    --sparsity-conv
    --sp-conv-tile-size
    
  • Skip the first transformer layer and/or last linear layer (ViT only):
    --skip-last-layer-sparsity
    --skip-first-transformer-sparsity
    

For example, if you would like to train a vit_b_16 from scratch using Supermask, you can use the respective torchvision command found in TRAINING.md and append the supermask arguments:

torchrun --nproc_per_node=8 train.py\
    --model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema\ 
    --sparsity-linear 0.9 --sp-linear-tile-size 32

Through this command, we are training a vit_b_16 with 90% sparsity to linear layers using 32x32 tiles.

Please run python train.py --help for a full list of available arguments.

Evaluation

To run an evaluation of a Supermask-trained model, you can use evaluate.py. Our current version has signficant speedup with float32 only and not float16, hence, to illustrate speedup, we don't pass --amp in the example commands below.

MODEL_PATH=<put the path of the trained checkpoint here>
IMAGENET_PATH=<put the path of ImageNet dataset here>
NGPUS=1 # put number of available GPUS here
  • Offline sparsification with BSR:

    torchrun --nproc_per_node=${NGPUS} evaluate.py  --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH}  --data-path ${IMAGENET_PATH} --sparsify-weights --bsr 32
    

    This command applies 90% sparsity to linear layers using 32x32 tiles, loads the model weights from ${MODEL_PATH}, loads the ImageNet validation set located at the specified path, applies offline sparsification to the weights, and converts the sparse weights to BSR format with a block size of 32. It is recommended to set --bsr the same as tile size.

  • Online sparsification without BSR:

    torchrun --nproc_per_node=${NGPUS} evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH} --data-path ${IMAGENET_PATH}
    

    This is similar to the previous command, but it does not apply offline sparsification or BSR conversion. Instead, the sparsity is applied on-the-fly during evaluation.

Please run python evaluate.py --help for a full list of available arguments.

Results (1x A100):

  • Baseline

    Test:  Total time: 0:02:11
    Test:  Acc@1 78.392 Acc@5 93.592
    
  • Sparsity= 0.9, Tile Size = 32, Online Sparsification, BSR = None

    Test:  Total time: 0:01:52
    Test:  Acc@1 76.092 Acc@5 92.656
    
  • Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = None

    Test:  Total time: 0:01:54
    Test:  Acc@1 76.092 Acc@5 92.656
    
  • Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = 32

    Test:  Total time: 0:01:25
    Test:  Acc@1 76.092 Acc@5 92.656
    

Pretrained Weights

Download:

Instead of training from scratch, if you'd like to use the Supermask weights of vit_b_16 trained on privacy mitigated Imagenet-blurred, you can download them here:

SPARSITY=0.80 # Checkpoints available for: 0.70, 0.80, 0.82, 0.84, 0.86, 0.88, 0.90
BLOCK_SIZE=32 # Checkpoints available for: 16, 32, 64
mkdir checkpoints
# For baseline,
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/baseline.pth -P checkpoints/
# For sparsified checkpoints,
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth -P checkpoints/

Benchmark:

python benchmark.py --model vit_b_16 \
  --batch-size 256 \
  --sparsity-linear ${SPARSITY} \
  --sp-linear-tile-size ${BLOCK_SIZE} \
  --sparsify-weights \
  --bsr ${BLOCK_SIZE} \
  --weights-path ./checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth \
  > /dev/null

Result:

530.342578125 ms

Evaluate:

8 x A100 GPUs:

torchrun --nproc_per_node=8 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}

Result:

Test:  Total time: 0:01:01
Test:  Acc@1 77.644 Acc@5 93.554

1 x A100 GPUs:

torchrun --nproc_per_node=1 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}

Result:

Test:  Total time: 0:01:51
Test:  Acc@1 77.644 Acc@5 93.554

License

SuperBlock is released under the MIT license.