-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for S^2 #1376
Add Support for S^2 #1376
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution, and congrats on
self.s2_split_size = self.s2_scales[0] | ||
self.s2_image_size = self.s2_scales[-1] | ||
|
||
# change resize/crop size in preprocessing to the largest image size in s2_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can move the import here
from s2wrapper import forward as multiscale_forward
and self.multiscale_forward = multiscale_forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe you can add a exception handling on importerror to prompt the user to install s2wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Haotian, thanks for the suggestions! I've made the changes accordingly. Please take a look when you get a chance. Thanks!
Thanks! |
should not the model arguments be also changed to include s2 and s2_scales? |
Hi @lightingvector, Yes that's correct. Currently this PR only supports running and evaluating an already pre-trained LLaVA with S2. If you want to train LLaVA with S2 you need to add Thanks for pointing it out! |
Thanks for great works done, @bfshi I've got few questions, And I'm curious about effect of S2 on LLaVA regarding benchmarks other than V*, if you have any results or trained weights, can you share some? |
Hi @diridiri, Thanks for the interest! Yes, you need to train a mm_projector with the new hidden size. For results of S2 on other benchmarks, please refer to Table 11 in Appendix D in the paper. We are planning to release the checkpoint for LLaVA1.5 with S2 soon. For LLaVA1.6, since the training code is not released yet, we don't have the checkpoints currently. |
Appreciate your guidance, I have one more question about your implementation :), since I encountered error in Training with S2. In current implementation, the code below will call super class ( class CLIPVisionTowerS2(CLIPVisionTower):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__(vision_tower, args, delay_load)
self.s2_scales = getattr(args, 's2_scales', '336,672,1008')
self.s2_scales = list(map(int, self.s2_scales.split(',')))
self.s2_scales.sort()
self.s2_split_size = self.s2_scales[0]
self.s2_image_size = self.s2_scales[-1]
try:
from s2wrapper import forward as multiscale_forward
except ImportError:
raise ImportError('Package s2wrapper not found! Please install by running: \npip install git+https://github.com/bfshi/scaling_on_scales.git')
self.multiscale_forward = multiscale_forward
# change resize/crop size in preprocessing to the largest image size in s2_scale
if not delay_load or getattr(args, 'unfreeze_mm_vision_tower', False):
self.image_processor.size['shortest_edge'] = self.s2_image_size
self.image_processor.crop_size['height'] = self.image_processor.crop_size['width'] = self.s2_image_size Then the constructor of class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
self.select_layer = args.mm_vision_select_layer
self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
if not delay_load:
self.load_model()
elif getattr(args, 'unfreeze_mm_vision_tower', False):
self.load_model()
else:
self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name) And here I encounter error def load_model(self, device_map=None):
if self.is_loaded:
print('{} is already loaded, `load_model` called again, skipping.'.format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.image_processor.size['shortest_edge'] = self.s2_image_size
self.image_processor.crop_size['height'] = self.image_processor.crop_size['width'] = self.s2_image_size
self.is_loaded = True Switching the order of initialization of attributes ( |
Thanks for pointing this out! Yeah, probably the simplest fix is defining |
I also found this minor bug, and I submitted this PR for a quick fix. |
Hello, Is this meant to support quantization? When I load a model with
Whereas Full stack trace
|
Hi @VldmrB, are you evaluating with CLIPVisionTowerS2? Did you train a model with S2 or did you use the LLaVA checkpoint trained without S2? |
Hello
I just realized that you already mentioned that it will only work with a model trained with S2. My apologies, I misread it earlier. Thanks for doing this, in any case |
Hi @bfshi ,Thaks for your great work. Has the checkpoint for LLaVA1.5 with S2 been released yet? |
Hi @yanbai1993, Yes! Please see here in the S2 repo. |
What does this PR do?
This PR integrates S2 into LLaVA-NeXT.
What is S2?
S2 is a method to extract multi-scale features from an image. For example, given an image of 336x336, S2 interpolates the image to multiple scales such as 336x336, 672x672, 1008x1008, extracts features at each scale and merge the features into a multi-scale feature map. The multi-scale features contain more detailed information about an image which is beneficial for Multimodal LLMs. Meanwhile, S2 ensures the number of visual token sent to LLM is the same as the regular single-scale features such that no computational overhead on LLM is incurred.
Please find more details in the S2 paper and GitHub repo.
What does this PR contain?
There are two changes in this PR.
CLIPVisionTowerS2
is defined inllava/model/multimodal_encoder/clip_encoder.py
which augments a clip model with S2. This class is the same as the originalCLIPVisionTower
class except that it will return a multi-scale feature map instead of a single-scale feature map whenforward
is called. The multi-scale feature map has the same shape as the single-scale one, except on the channel dimension where multi-scale features havenum_scales * original_dim
dimensions (as defined inself.hidden_size
).llava/model/multimodal_encoder/builder.py
is modified so that it will buildCLIPVisionTowerS2
instead ofCLIPVisionTower
when S2 is used.How to train LLaVA with S2?
First install
s2wrapper
through pip:This package only has one dependency of
einops
, so installing it shouldn't interfere with your environment.Training configurations should be the same as training a regular LLaVA without anyres (i.e.,
image_aspect_ratio="pad"
andmm_patch_merge_type="flat"
), except for two new model configs:s2=True
. This turns on the usage of S2.s2_scales="336,672,1008"
. This specifies the image scales S2 will extract features on.