The text-to-video retrieval task requires to rank all the videos in a database based on how semantically close they are to an input query. To do so, both the visual and the textual contents need to be carefully analyzed and understood, meaning that a wide range of Computer Vision and Natural Language Processing techniques are required. Despite the intrinsic difficulty of such a problem, it is a fundamental one: in fact, nowadays several hundreds of hours of video content are uploaded to the Internet every minute, therefore solutions to this important problem are fundamental to perform searches effectively and retrieve all the videos which the user is looking for. Moreover, considering the need for multi-modal content understanding, advancements in this field may lead to improvements in many other problems, including Captioning and Question Answering. EPIC-Kitchens-100 Multi-Instance Retrieval Challenge tech reports: Alex Falcon, Giuseppe Serra. UniUD Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023. We ranked 3rd using only 25% of […]