-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support open_clip with NPU backend #813
Conversation
Cool! How is the inference and training speed?
…On Mon, Feb 5, 2024, 9:15 AM Mengqing Cao ***@***.***> wrote:
openclip performs great on CLIP model training and inference, but
unfortunately, it seems to only support gpu and cpu at the moment. I notice
that there is a need for other backend:
- TPU support. #20
<#20>
- More backends support #796
<#796>
And this PR add Ascend NPU backend support. I test the NPU-support feature
by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything
goes well.
*eval on npu run with:*
python3 -m training.main \
--model ViT-L-14 \
--pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \
--seed 0 \
--imagenet-val './data/ImageNet-1000/val'
The pretrained wights is downloaded from
laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
<https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K>
The evaluation results of ViT-L-14 on npu:
- imagenet-zeroshot-val-top1: *78.89%*
- imagenet-zeroshot-val-top5: *95.46%*
image.png (view on web)
<https://github.com/mlfoundations/open_clip/assets/52243582/3df7fb0c-9928-4944-8a1a-e358240725b3>
The results are close to that of gpu's (top-1 acc: 79.2%).
detailed training logs:
2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0.
2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config.
2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin).
2024-02-05,08:00:21 | INFO | Model:
2024-02-05,08:00:21 | INFO | CLIP(
(visual): VisionTransformer(
(conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(patch_dropout): Identity()
(ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(transformer): Transformer(
(resblocks): ModuleList(
(0-23): 24 x ResidualAttentionBlock(
(ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
)
(ls_1): Identity()
(ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
(gelu): GELU(approximate='none')
(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
)
(ls_2): Identity()
)
)
)
(ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(transformer): Transformer(
(resblocks): ModuleList(
(0-11): 12 x ResidualAttentionBlock(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(ls_1): Identity()
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='none')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
(ls_2): Identity()
)
)
)
(token_embedding): Embedding(49408, 768)
(ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-02-05,08:00:21 | INFO | Params:
2024-02-05,08:00:21 | INFO | accum_freq: 1
2024-02-05,08:00:21 | INFO | aug_cfg: {}
2024-02-05,08:00:21 | INFO | batch_size: 64
2024-02-05,08:00:21 | INFO | beta1: 0.9
2024-02-05,08:00:21 | INFO | beta2: 0.98
2024-02-05,08:00:21 | INFO | checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints
2024-02-05,08:00:21 | INFO | coca_caption_loss_weight: 2.0
2024-02-05,08:00:21 | INFO | coca_contrastive_loss_weight: 1.0
2024-02-05,08:00:21 | INFO | copy_codebase: False
2024-02-05,08:00:21 | INFO | csv_caption_key: title
2024-02-05,08:00:21 | INFO | csv_img_key: filepath
2024-02-05,08:00:21 | INFO | csv_separator:
2024-02-05,08:00:21 | INFO | dataset_resampled: False
2024-02-05,08:00:21 | INFO | dataset_type: auto
2024-02-05,08:00:21 | INFO | ddp_static_graph: False
2024-02-05,08:00:21 | INFO | debug: False
2024-02-05,08:00:21 | INFO | delete_previous_checkpoint: False
2024-02-05,08:00:21 | INFO | device: npu:0
2024-02-05,08:00:21 | INFO | dist_backend: nccl
2024-02-05,08:00:21 | INFO | dist_url: env://
2024-02-05,08:00:21 | INFO | distill: False
2024-02-05,08:00:21 | INFO | distill_model: None
2024-02-05,08:00:21 | INFO | distill_pretrained: None
2024-02-05,08:00:21 | INFO | distributed: False
2024-02-05,08:00:21 | INFO | epochs: 32
2024-02-05,08:00:21 | INFO | epochs_cooldown: None
2024-02-05,08:00:21 | INFO | eps: 1e-06
2024-02-05,08:00:21 | INFO | force_custom_text: False
2024-02-05,08:00:21 | INFO | force_image_size: None
2024-02-05,08:00:21 | INFO | force_patch_dropout: None
2024-02-05,08:00:21 | INFO | force_quick_gelu: False
2024-02-05,08:00:21 | INFO | gather_with_grad: False
2024-02-05,08:00:21 | INFO | grad_checkpointing: False
2024-02-05,08:00:21 | INFO | grad_clip_norm: None
2024-02-05,08:00:21 | INFO | horovod: False
2024-02-05,08:00:21 | INFO | image_interpolation: None
2024-02-05,08:00:21 | INFO | image_mean: None
2024-02-05,08:00:21 | INFO | image_resize_mode: None
2024-02-05,08:00:21 | INFO | image_std: None
2024-02-05,08:00:21 | INFO | imagenet_v2: None
2024-02-05,08:00:21 | INFO | imagenet_val: ./data/ImageNet-1000/val
2024-02-05,08:00:21 | INFO | local_loss: False
2024-02-05,08:00:21 | INFO | local_rank: 0
2024-02-05,08:00:21 | INFO | lock_image: False
2024-02-05,08:00:21 | INFO | lock_image_freeze_bn_stats: False
2024-02-05,08:00:21 | INFO | lock_image_unlocked_groups: 0
2024-02-05,08:00:21 | INFO | lock_text: False
2024-02-05,08:00:21 | INFO | lock_text_freeze_layer_norm: False
2024-02-05,08:00:21 | INFO | lock_text_unlocked_layers: 0
2024-02-05,08:00:21 | INFO | log_every_n_steps: 100
2024-02-05,08:00:21 | INFO | log_level: 20
2024-02-05,08:00:21 | INFO | log_local: False
2024-02-05,08:00:21 | INFO | log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log
2024-02-05,08:00:21 | INFO | logs: ./logs/
2024-02-05,08:00:21 | INFO | lr: 0.0005
2024-02-05,08:00:21 | INFO | lr_cooldown_end: 0.0
2024-02-05,08:00:21 | INFO | lr_cooldown_power: 1.0
2024-02-05,08:00:21 | INFO | lr_scheduler: cosine
2024-02-05,08:00:21 | INFO | model: ViT-L-14
2024-02-05,08:00:21 | INFO | name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp
2024-02-05,08:00:21 | INFO | no_set_device_rank: False
2024-02-05,08:00:21 | INFO | precision: amp
2024-02-05,08:00:21 | INFO | pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
2024-02-05,08:00:21 | INFO | pretrained_image: False
2024-02-05,08:00:21 | INFO | rank: 0
2024-02-05,08:00:21 | INFO | remote_sync: None
2024-02-05,08:00:21 | INFO | remote_sync_frequency: 300
2024-02-05,08:00:21 | INFO | remote_sync_protocol: s3
2024-02-05,08:00:21 | INFO | report_to:
2024-02-05,08:00:21 | INFO | resume: None
2024-02-05,08:00:21 | INFO | save_frequency: 1
2024-02-05,08:00:21 | INFO | save_most_recent: False
2024-02-05,08:00:21 | INFO | seed: 0
2024-02-05,08:00:21 | INFO | siglip: False
2024-02-05,08:00:21 | INFO | skip_scheduler: False
2024-02-05,08:00:21 | INFO | tensorboard: False
2024-02-05,08:00:21 | INFO | tensorboard_path:
2024-02-05,08:00:21 | INFO | torchcompile: False
2024-02-05,08:00:21 | INFO | torchscript: False
2024-02-05,08:00:21 | INFO | trace: False
2024-02-05,08:00:21 | INFO | train_data: None
2024-02-05,08:00:21 | INFO | train_data_upsampling_factors: None
2024-02-05,08:00:21 | INFO | train_num_samples: None
2024-02-05,08:00:21 | INFO | use_bn_sync: False
2024-02-05,08:00:21 | INFO | use_bnb_linear: None
2024-02-05,08:00:21 | INFO | val_data: None
2024-02-05,08:00:21 | INFO | val_frequency: 1
2024-02-05,08:00:21 | INFO | val_num_samples: None
2024-02-05,08:00:21 | INFO | wandb: False
2024-02-05,08:00:21 | INFO | wandb_notes:
2024-02-05,08:00:21 | INFO | wandb_project_name: open-clip
2024-02-05,08:00:21 | INFO | warmup: 10000
2024-02-05,08:00:21 | INFO | wd: 0.2
2024-02-05,08:00:21 | INFO | workers: 4
2024-02-05,08:00:21 | INFO | world_size: 1
2024-02-05,08:00:21 | INFO | zeroshot_frequency: 2
2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet.
2024-02-05,08:00:21 | INFO | Building zero-shot classifier
2024-02-05,08:01:13 | INFO | Using classifier
2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet.
2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889 imagenet-zeroshot-val-top5: 0.9546
------------------------------
You can view, comment on, or merge this pull request online at:
#813
Commit Summary
- a6a2032
<a6a2032>
add npu support
File Changes
(5 files <https://github.com/mlfoundations/open_clip/pull/813/files>)
- *A* requirements-npu.txt
<https://github.com/mlfoundations/open_clip/pull/813/files#diff-9b6c5e535fc5c475ff121268847e0dcd5d633fc27a6e0aa0781540ca7252e0e4>
(7)
- *M* src/training/distributed.py
<https://github.com/mlfoundations/open_clip/pull/813/files#diff-467ce0e8c18cca22eccaee323a96ae4c702ff61cf45eee98530c4667453ca193>
(9)
- *M* src/training/main.py
<https://github.com/mlfoundations/open_clip/pull/813/files#diff-8cac5527ae65d91d536016bb558349a70695c2e856a3e0526b21df7c69f9b8b2>
(10)
- *M* src/training/precision.py
<https://github.com/mlfoundations/open_clip/pull/813/files#diff-fd92cef91b8b92ae70b1773f98ee605a215a10eaf7b54d794d00b25c6aa30571>
(5)
- *M* src/training/profiler.py
<https://github.com/mlfoundations/open_clip/pull/813/files#diff-a98ec43d4829e757d6822f426daf81c934360a394502f3570e89112a4678a6c2>
(7)
Patch Links:
- https://github.com/mlfoundations/open_clip/pull/813.patch
- https://github.com/mlfoundations/open_clip/pull/813.diff
—
Reply to this email directly, view it on GitHub
<#813>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437XGQNIGQOL7NQBGUQ3YSCIJZAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTOOJSG44TAOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
A metric we usually look at is the sample/s per accelerator.
Some baselines: on one 3080 GPUs
- B/32 inference speed is about 1300 sample/s
- L/14 is about 300 sample/s
Usually increasing the batch size to values like 256 help.
For training on one A100 it looks like
- 250 sample/s for B/32 (can be more if using less accelerators, hence
having less interconnect bottleneck)
- 80 sample/s for L/14
Usually with batch sizes around 128 per GPU.
I think it would be very interesting to have similar numbers on NPU
…On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao ***@***.***> wrote:
Cool! How is the inference and training speed?
Your speed of reply is amazing! : )
As the following pic shows, it takes around 55s for inferencing ViT-L-14
on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device)
image.png (view on web)
<https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6>
So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS
required?
—
Reply to this email directly, view it on GitHub
<#813 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
sorry for the late reply and thanks for your explanation. I've noticed that code implementations of this metric exist in the training pipeline, and it is named
I'm a bit confused whether the Screenshots |
6cc2de4
to
cd48368
Compare
@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon. |
Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw |
Sorry for bothering you. Could you help for reviewing this PR? @rwightman @gabrielilharco |
a5e22fb
to
b92a266
Compare
@MengqingCao similar to timm comments, if |
Thanks for your review, the latest coomit has fix it. |
this was merged through #965 ... not auto-closed for some reason |
openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:
And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.
eval on npu run with:
The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
The evaluation results of ViT-L-14 on npu:
The results are close to that of gpu's (top-1 acc: 79.2%).
detailed training logs: