You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
Summary
The authors analyze a large multimodal “CLIP” model which consists of two sides: first, a ResNet that processes images and second, a Transformer that processes text. The model is trained to align the visual and semantic streams of information using a contrastive loss. The core finding of this work is that CLIP models have neurons that respond to concepts in a multimodal manner. In the same way that human neurons can respond to the “Halle Berry” concept across different modes (text, image, drawing), the CLIP model has neurons that respond to the “Spiderman” concept across different modes.
The authors investigate a range of other multimodal concepts including people, emotions, and regions. Then they make some general remarks on the properties of these features (including vector arithmetic, bias, and polysemanticity). They look at how the high-level abstraction of these features makes them useful for downstream tasks, such as ImageNet classification, including a neuron-derived taxonomy of animals in Figure 8. They investigate compositional behavior of neurons in the context of emotions. Finally, they conduct typographic attacks on the model by placing text atop images in order to change the model’s response to the image. This effect is analogous to the Stroop effect -- and they even observe the Stroop effect in a subsequent experiment.
Strengths and weaknesses
Strengths. The results are significant. The writing is clear and focused. The figures generally support the claims in the text quite well. The links to Microscope aid in deeper understanding (the combination of feature visualization + dataset samples is compelling; the two approaches are rather complimentary). The connections to neuroscience are very strong; it will be interesting to see how this changes the way neuroscientists think about multimodal perception and representation.
Political and cultural considerations. This work has the potential to cause a strong political reaction, given the way these visualizations make clear the model’s politically-charged biases. The authors appear to have taken proper precautions to clarify their objectivity on these issues. One negative outcome, in the short term, may be from some people making incorrect and divisive claims about the nature and objectives of this paper. One positive outcome, in the short and long term, may be from other people taking an increased interest in finding responsible means of mitigating these biases. In spite of potential negative short-term reactions, the release of this paper seems to constitute a net good.
Weaknesses. The largest issue with this work from a scientific perspective is that the results are potentially difficult for other scientists to reproduce. Neither the model nor the dataset are open-source. However, significant large-scale scientific results are inherently difficult to replicate (eg, sequencing the Human Genome, or measuring gravity waves), at the time of release. Even so, I would encourage the authors to accompany this paper with more technical details of the model and dataset, as most readers will want to look into these things.
Comments, questions
Unfortunate that the Hitler neuron responds to German food
Is there any way to determine statistical significance estimated conditional probabilities of neuron classes (eg, for Figures 2, 5, and 7).
Grammatical error: “with more than 90% of the images with a standard deviation greater than 30 are related to Donald Trump”
Faceted feature visualization results are very compelling. I have not seen “faceted feature visualization” before. Reading the appendix, this appears to be a new approach. Would love to see a longer exposition on this, as the results in Figure 4 are strong.
Figure 6 is a tour de force of visualization. Nit: “city name activations” give large dots on map in Asia and Europe, but not elsewhere. Why?
Figure 7 “This is the first neuron we've studied closely with a distinct regime change between medium and strong activations.” Interesting, I wonder why? “Flags” category has a large variance, going from -6 to +22 (perhaps it reflects the degree to which each flag is correlated with “Ghana;” some flags will have low correlation whereas others will have high).
Nit: feature vis for “string instruments” is a face
“Scorpion” is labeled as “fish” and “seafood” in Figure 8. This is an understandable failure mode, but maybe worth pointing out.
Grammar: “given an text embedding”
Leveraging the bilinear term to obtain feature visualizations that correspond to text is an interesting idea in that it permits exploration of highly abstract concepts in a visual space.
Why are the feature visualizations of concepts like “lightning” “art” “painting” and “miracle” faces in Figure 10? Could we have used faceted feature visualization to improve the visualization? The diagram still makes intuitive sense and does not need to change.
Adversarial attacks section is clear and experiments make sense. I particularly enjoy the connection to the Stroop effect.
No conclusion section?
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results
Advancing the Dialogue
Score
How significant are these contributions?
5/5
Outstanding Communication
Score
Article Structure
4/5
Writing Style
5/5
Diagram & Interface Style
5/5
Impact of diagrams / interfaces / tools for thought?
4/5
Readability
4/5
Scientific Correctness & Integrity
Score
Are claims in the article well supported?
4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?
5/5
How easy would it be to replicate (or falsify) the results?
3/5
Does the article cite relevant work?
5/5
Does the article exhibit strong intellectual honesty and scientific hygiene?
5/5
The text was updated successfully, but these errors were encountered:
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
Summary
The authors analyze a large multimodal “CLIP” model which consists of two sides: first, a ResNet that processes images and second, a Transformer that processes text. The model is trained to align the visual and semantic streams of information using a contrastive loss. The core finding of this work is that CLIP models have neurons that respond to concepts in a multimodal manner. In the same way that human neurons can respond to the “Halle Berry” concept across different modes (text, image, drawing), the CLIP model has neurons that respond to the “Spiderman” concept across different modes.
The authors investigate a range of other multimodal concepts including people, emotions, and regions. Then they make some general remarks on the properties of these features (including vector arithmetic, bias, and polysemanticity). They look at how the high-level abstraction of these features makes them useful for downstream tasks, such as ImageNet classification, including a neuron-derived taxonomy of animals in Figure 8. They investigate compositional behavior of neurons in the context of emotions. Finally, they conduct typographic attacks on the model by placing text atop images in order to change the model’s response to the image. This effect is analogous to the Stroop effect -- and they even observe the Stroop effect in a subsequent experiment.
Strengths and weaknesses
Strengths. The results are significant. The writing is clear and focused. The figures generally support the claims in the text quite well. The links to Microscope aid in deeper understanding (the combination of feature visualization + dataset samples is compelling; the two approaches are rather complimentary). The connections to neuroscience are very strong; it will be interesting to see how this changes the way neuroscientists think about multimodal perception and representation.
Political and cultural considerations. This work has the potential to cause a strong political reaction, given the way these visualizations make clear the model’s politically-charged biases. The authors appear to have taken proper precautions to clarify their objectivity on these issues. One negative outcome, in the short term, may be from some people making incorrect and divisive claims about the nature and objectives of this paper. One positive outcome, in the short and long term, may be from other people taking an increased interest in finding responsible means of mitigating these biases. In spite of potential negative short-term reactions, the release of this paper seems to constitute a net good.
Weaknesses. The largest issue with this work from a scientific perspective is that the results are potentially difficult for other scientists to reproduce. Neither the model nor the dataset are open-source. However, significant large-scale scientific results are inherently difficult to replicate (eg, sequencing the Human Genome, or measuring gravity waves), at the time of release. Even so, I would encourage the authors to accompany this paper with more technical details of the model and dataset, as most readers will want to look into these things.
Comments, questions
Unfortunate that the Hitler neuron responds to German food
Is there any way to determine statistical significance estimated conditional probabilities of neuron classes (eg, for Figures 2, 5, and 7).
Grammatical error: “with more than 90% of the images with a standard deviation greater than 30 are related to Donald Trump”
Faceted feature visualization results are very compelling. I have not seen “faceted feature visualization” before. Reading the appendix, this appears to be a new approach. Would love to see a longer exposition on this, as the results in Figure 4 are strong.
Figure 6 is a tour de force of visualization. Nit: “city name activations” give large dots on map in Asia and Europe, but not elsewhere. Why?
Figure 7 “This is the first neuron we've studied closely with a distinct regime change between medium and strong activations.” Interesting, I wonder why? “Flags” category has a large variance, going from -6 to +22 (perhaps it reflects the degree to which each flag is correlated with “Ghana;” some flags will have low correlation whereas others will have high).
Nit: feature vis for “string instruments” is a face
“Scorpion” is labeled as “fish” and “seafood” in Figure 8. This is an understandable failure mode, but maybe worth pointing out.
Grammar: “given an text embedding”
Leveraging the bilinear term to obtain feature visualizations that correspond to text is an interesting idea in that it permits exploration of highly abstract concepts in a visual space.
Why are the feature visualizations of concepts like “lightning” “art” “painting” and “miracle” faces in Figure 10? Could we have used faceted feature visualization to improve the visualization? The diagram still makes intuitive sense and does not need to change.
Adversarial attacks section is clear and experiments make sense. I particularly enjoy the connection to the Stroop effect.
No conclusion section?
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results
The text was updated successfully, but these errors were encountered: