index.json

[{"authors":["admin"],"categories":null,"content":"Idan\u0026rsquo;s research revolves around cognition in deep learning models, encompassing multimodal problems, attention, model perceptiveness, and model comprehension.\nIdan is a postdoctoral researcher in the Computer Science department at Tel-Aviv University, where he collaborates with Prof. Lior Wolf. Before this, he earned his PhD in Computer Science from the Technion, under the supervision of Prof. Alexander G. Schwing and Prof. Tamir Hazan. In 2016, he joined eBay as a researcher, applying computer vision and natural language processing solutions to enhance eBay’s catalog. He then moved to Microsoft in 2019, becoming a researcher within the Search, Assistant, and Intelligence group. By 2020, he had assumed the role of Head of Research at Spot, a company later acquired by NetApp.\n","date":-62135596800,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":1692779193,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://idansc.github.io/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"Idan\u0026rsquo;s research revolves around cognition in deep learning models, encompassing multimodal problems, attention, model perceptiveness, and model comprehension.\nIdan is a postdoctoral researcher in the Computer Science department at Tel-Aviv University, where he collaborates with Prof. Lior Wolf. Before this, he earned his PhD in Computer Science from the Technion, under the supervision of Prof. Alexander G. Schwing and Prof. Tamir Hazan. In 2016, he joined eBay as a researcher, applying computer vision and natural language processing solutions to enhance eBay’s catalog.","tags":null,"title":"Idan Schwartz","type":"authors"},{"authors":["G. Yariv","I. Gat","S. Benaim","L. Wolf","I. Schwartz","Y. Adi"],"categories":null,"content":"","date":1700438400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1700438400,"objectID":"5f0dd46ce95967be5618d15279167a52","permalink":"https://idansc.github.io/publication/yariv2023diverse/","publishdate":"2023-11-20T12:32:10.221016Z","relpermalink":"/publication/yariv2023diverse/","section":"publication","summary":"We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. ","tags":null,"title":"Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation ","type":"publication"},{"authors":["Y. Tewel","Y. Shalev","R. Nadler","I. Schwartz","L. Wolf"],"categories":null,"content":"","date":1692748800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1692748800,"objectID":"fb057bcb1d9f4dd10083a9f52f546896","permalink":"https://idansc.github.io/publication/tewel2022zero/","publishdate":"2023-08-20T12:32:10.221016Z","relpermalink":"/publication/tewel2022zero/","section":"publication","summary":"We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model. The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames. Unlike zero-shot image captioning methods, our work considers the entire sentence at once. This is achieved by optimizing, during the generation process, part of the prompt from scratch, by modifying the representation of all other tokens in the prompt, and by repeating the process iteratively, gradually improving the specificity and comprehensiveness of the generated sentence. Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.","tags":null,"title":"Zero-shot video captioning with evolving pseudo-tokens","type":"publication"},{"authors":["G. Yariv","I. Gat","L. Wolf","Y. Adi","I. Schwartz"],"categories":null,"content":"","date":1681948800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1681948800,"objectID":"2483c5aaef1383a597a2fa2ad5fff396","permalink":"https://idansc.github.io/publication/yariv2023audiotoken/","publishdate":"2023-04-20T12:32:10.221016Z","relpermalink":"/publication/yariv2023audiotoken/","section":"publication","summary":"In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: how can we adopt such models to be conditioned on other modalities?. In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics.","tags":null,"title":"AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation","type":"publication"},{"authors":["I. Schwartz","V. Snæbjarnarson","S. Benaim","H. Chefer","R. Cotterell","L. Wolf","S. Belongie"],"categories":null,"content":"","date":1681948800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1681948800,"objectID":"3f12ca6350b26b1e179a3f2a75dede53","permalink":"https://idansc.github.io/publication/schwartz2023discriminative/","publishdate":"2023-04-20T12:32:10.221016Z","relpermalink":"/publication/schwartz2023discriminative/","section":"publication","summary":"Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. However, generated images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This comes with a downside, doing so limits their expressive power: (i) supervised datasets are generally small compared to large-scale scraped text-image datasets on which text-to-image models are trained, and so the quality and diversity of generated images are severely affected, or (ii) the input is a hard-coded label, as opposed to free-form text, which limits the control over the generated images. In this work, we propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier, which guides the generation. This is done by iteratively modifying the embedding of a single input token of a text-to-image diffusion model, using the classifier, by steering generated images toward a given target class. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images or retraining of a noise-tolerant classifier. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.","tags":null,"title":"Discriminative Class Tokens for Text-to-Image Diffusion Models","type":"publication"},{"authors":["O. Hupert","I. Schwartz","L. Wolf"],"categories":null,"content":"","date":1666224000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1666224000,"objectID":"d7941f796262e511ec63e8c28fd90d50","permalink":"https://idansc.github.io/publication/hupert2022describing/","publishdate":"2022-10-20T12:32:10.221016Z","relpermalink":"/publication/hupert2022describing/","section":"publication","summary":"We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set. Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases. First, a centroid phrase that has the largest average semantic similarity to the images in the set is generated, where both the computation of the similarity and the generation are based on pretrained vision-language models. Then, the phrase that generates the highest variation among the similarity scores is generated, using the same models. The next phrase maximizes the variance subject to being orthogonal, in the latent space, to the highest-variance phrase, and the process continues. Our experiments show that our method is able to convincingly capture the essence of image sets and describe the individual elements in a semantically meaningful way within the context of the entire set.","tags":null,"title":"Describing Sets of Images with Textual-PCA","type":"publication"},{"authors":["H. Chefer","I. Schwartz","L. Wolf"],"categories":null,"content":"","date":1658275200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1658275200,"objectID":"12911871d6ab3a24856cfea49ff20a84","permalink":"https://idansc.github.io/publication/chefer2022optimizing/","publishdate":"2022-07-20T12:32:10.221016Z","relpermalink":"/publication/chefer2022optimizing/","section":"publication","summary":"It has been observed that visual classification models often rely mostly on spurious cues such as the image background, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and direct the model to base its prediction on the foreground object.This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain-shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.","tags":null,"title":"Optimizing Relevance Maps of Vision Transformers Improves Robustness","type":"publication"},{"authors":["Y. Tewel","Y. Shalev","I. Schwartz","L. Wolf"],"categories":null,"content":"","date":1655683200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1655683200,"objectID":"b69e47712c5b23ca85f26b047eb84106","permalink":"https://idansc.github.io/publication/tewel2022zerocap/","publishdate":"2022-06-20T12:32:10.221016Z","relpermalink":"/publication/tewel2022zerocap/","section":"publication","summary":"Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.","tags":null,"title":"ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic","type":"publication"},{"authors":["A. Ali","I. Schwartz","T. Hazan","L. Wolf"],"categories":null,"content":"","date":1653004800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1653004800,"objectID":"180c9cd73de9b8ca4217ad504a5533a5","permalink":"https://idansc.github.io/publication/ali2022video/","publishdate":"2022-05-20T12:32:10.221016Z","relpermalink":"/publication/ali2022video/","section":"publication","summary":"We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provide explainability, and achieves a state-of-the-art-results, for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.. ","tags":null,"title":"Video and Text Matching with Conditioned Embeddings","type":"publication"},{"authors":["T. Braude","I. Schwartz","A. ~G. Schwing","A. Shamir"],"categories":null,"content":"","date":1645315200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1645315200,"objectID":"e0338a05472babcae0fda1b1832a1688","permalink":"https://idansc.github.io/publication/braude2022ordered/","publishdate":"2022-02-20T12:32:10.221016Z","relpermalink":"/publication/braude2022ordered/","section":"publication","summary":"We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each story sentence should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. Current approaches encode images independently, disregarding relations between images. Our approach learns to encode images with different interactions based on the story position (i.e., past image or future image). To this end, we develop a novel message-passing-like algorithm for ordered image attention (OIA) that collects interactions across all the images in the sequence. Finally, to generate the story's sentences, a second attention mechanism picks the important image attention vectors with an Image-Sentence Attention (ISA). The obtained results improve the METEOR score on the VIST dataset by 1%. Furthermore, a thorough human study confirms improvements and demonstrates that order-based interactions significantly improve coherency (64.20% vs. 28.70%).","tags":null,"title":"Ordered attention for coherent visual storytelling","type":"publication"},{"authors":["I. Gat","G. Lorberbom","I. Schwartz","T. Hazan"],"categories":null,"content":"","date":1641081600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1641081600,"objectID":"06dfc69e23335dcba4f44ffdb22517dc","permalink":"https://idansc.github.io/publication/gat2022latent/","publishdate":"2022-01-02T12:32:10.221016Z","relpermalink":"/publication/gat2022latent/","section":"publication","summary":"The success of deep neural nets heavily relies on their ability to encode complex relations between their input and their output. While this property serves to fit the training data well, it also obscures the mechanism that drives prediction. This study aims to reveal hidden concepts by employing an intervention mechanism that shifts the predicted class based on discrete variational autoencoders. An explanatory model then visualizes the encoded information from any hidden layer and its corresponding intervened representation. By the assessment of differences between the original representation and the intervened representation, one can determine the concepts that can alter the class, hence providing interpretability. We demonstrate the effectiveness of our approach on CelebA, where we show various visualizations for bias in the data and suggest different interventions to reveal and change bias.","tags":null,"title":"Latent space explanation by intervention","type":"publication"},{"authors":["I. Gat","I. Schwartz","A. ~G. Schwing"],"categories":null,"content":"","date":1630454400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"faa27280a09e6ca47854593882dc3b6d","permalink":"https://idansc.github.io/publication/gat-neurips-2021/","publishdate":"2021-09-10T12:32:10.221016Z","relpermalink":"/publication/gat-neurips-2021/","section":"publication","summary":"Machine learning advances in the last decade have relied significantly on large-scale datasets that continue to grow in size. Increasingly, those datasets also contain different data modalities. However, large multi-modal datasets are hard to annotate, and annotations may contain biases that we are often unaware of. Deep-net-based classifiers, in turn, are prone to exploit those biases and to find shortcuts. To study and quantify this concern, we introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features, i.e., modalities. Using the perceptual score, we find a surprisingly consistent trend across four popular datasets: recent, more accurate state-of-the-art multi-modal models for visual question-answering or visual dialog tend to perceive the visual data less than their predecessors. This is concerning as answers are hence increasingly inferred from textual cues only. Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions. We hope to spur a discussion on the perceptiveness of multi-modal models and also hope to encourage the community working on multi-modal classifiers to start quantifying perceptiveness via the proposed perceptual score.","tags":null,"title":"Perceptual Score: Measuring Perceptiveness of Multi-Modal Classifiers","type":"publication"},{"authors":["I. Schwartz"],"categories":null,"content":"","date":1622505600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"cddc86c1cf3c792d9686cd902f6870fd","permalink":"https://idansc.github.io/publication/schwartz-naacl-2021/","publishdate":"2021-06-10T12:32:10.221016Z","relpermalink":"/publication/schwartz-naacl-2021/","section":"publication","summary":" Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., 'yeah' and 'yes'). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as 'I don't know'. Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge.","tags":null,"title":"Ensemble of MRR and NDCG models for Visual Dialog","type":"publication"},{"authors":["I. Gat","I. Schwartz","A. ~G. Schwing","T. Hazan"],"categories":null,"content":"","date":1601510400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"20fd2862585e41ca548080f5d25df2e7","permalink":"https://idansc.github.io/publication/gat-neurips-2020/","publishdate":"2020-10-10T12:32:10.221016Z","relpermalink":"/publication/gat-neurips-2020/","section":"publication","summary":"Many recent datasets contain a variety of different data modalities, for instance, image, question, and answer data in visual question answering (VQA). When training deep net classifiers on those multi-modal datasets, the modalities get exploited at different scales, i.e., some modalities can more easily contribute to the classification results than others. This is suboptimal because the classifier is inherently biased towards a subset of the modalities. To alleviate this shortcoming, we propose a novel regularization term based on the functional entropy. Intuitively, this term encourages to balance the contribution of each modality to the classification result. However, regularization with the functional entropy is challenging. To address this, we develop a method based on the log-Sobolev inequality, which bounds the functional entropy with the functional-Fisher-information. Intuitively, this maximizes the amount of information that the modalities contribute. On the two challenging multi-modal datasets VQA-CPv2, and SocialIQ, we obtain state-of-the-art results while more uniformly exploiting the modalities. In addition, we demonstrate the efficacy of our method on Colored MNIST.","tags":null,"title":"Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies","type":"publication"},{"authors":["I. Schwartz","A. ~G. Schwing","T. Hazan"],"categories":null,"content":"","date":1560297600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"55b78618193469566ace21183ab34b41","permalink":"https://idansc.github.io/publication/schwartz-cvpr-2019/","publishdate":"2019-05-31T12:32:10.221016Z","relpermalink":"/publication/schwartz-cvpr-2019/","section":"publication","summary":"The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.","tags":null,"title":"A Simple Baseline for Audio-Visual Scene-Aware Dialog","type":"publication"},{"authors":["I. Schwartz","A. ~G. Schwing","T. Hazan"],"categories":null,"content":"","date":1560297600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"6a3d011c4fef024bda78ce922f97099b","permalink":"https://idansc.github.io/publication/schwartz-fgacvpr-2019/","publishdate":"2019-05-31T12:35:14.854157Z","relpermalink":"/publication/schwartz-fgacvpr-2019/","section":"publication","summary":"Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.","tags":null,"title":"Factor Graph Attention","type":"publication"},{"authors":["I. Schwartz","A. ~G. Schwing","T. Hazan"],"categories":null,"content":"","date":1513209600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691577070,"objectID":"b716db9682e082edae946aae30e3ef50","permalink":"https://idansc.github.io/publication/schwartz-nips-2017/","publishdate":"2019-05-31T12:35:14.854941Z","relpermalink":"/publication/schwartz-nips-2017/","section":"publication","summary":"The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.","tags":null,"title":"High-Order Attention Models for Visual Question Answering","type":"publication"}]