
The text-to-video retrieval task requires to rank all the videos in a database based on how semantically close they are to an input query. To do so, both the visual and the textual contents need to be carefully analyzed and understood, meaning that a wide range of Computer Vision and Natural Language Processing techniques are required. Despite the intrinsic difficulty of such a problem, it is a fundamental one: in fact, nowadays several hundreds of hours of video content are uploaded to the Internet every minute, therefore solutions to this important problem are fundamental to perform searches effectively and retrieve all the videos which the user is looking for. Moreover, considering the need for multi-modal content understanding, advancements in this field may lead to improvements in many other problems, including Captioning and Question Answering.

Related publications:
- Alex Falcon, Giuseppe Serra, Oswald Lanz: “Learning Video Retrieval Models with Relevance-Aware Online Mining”, International Conference on Image Analysis and Processing (ICIAP ’21), 2022
- Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, Oswald Lanz: “Relevance-based Margin for Contrastively-trained Video Retrieval Models”, International Conference on Multimedia Retrieval (ICMR ’22), 2022
- Alex Falcon, Giuseppe Serra, Oswald Lanz: “A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval”, ACM International Conference on Multimedia (ACM MM ’22), 2022
Research group:
- Alex Falcon (AILAB-Udine Member)
- Oswald Lanz (TeV β FBK, AILAB-Udine External Collaborator)
- Giuseppe Serra (AILAB-Udine member)