
Video Question Answering (VideoQA) is a task that requires to analyze and jointly reason on both the given video data and a visual content-related question, to produce a meaningful and coherent answer to it. Solving this task would approach human-level capability of the model to deal with both complex video data and the visual contents-related textual data, since it would require to learn to isolate and pinpoint objects of interest in video, to identify and reason about their interactions in both spatial and temporal domains, while finding the essential bindings with the given question. Thus, VideoQA represents a challenging task at the interface between Computer Vision and Natural Language Processing (NLP). Modern approaches to this task involve a wide selection of different techniques, such as: temporal attention and spatial attention, in order to learn which frames and which regions in each frame are more important to solve the task; given the multimodal nature of the data, cross-modality fusion mechanisms, question-answer-aware […]