Video Question Answering

Video Question Answering (VideoQA) is a task that requires to analyze and jointly reason on both the given video data and a visual content-related question, to produce a meaningful and coherent answer to it. Solving this task would approach human-level capability of the model to deal with both complex video data and the visual contents-related textual data, since it would require to learn to isolate and pinpoint objects of interest in video, to identify and reason about their interactions in both spatial and temporal domains, while finding the essential bindings with the given question. Thus, VideoQA represents a challenging task at the interface between Computer Vision and Natural Language Processing (NLP).

Modern approaches to this task involve a wide selection of different techniques, such as: temporal attention and spatial attention, in order to learn which frames and which regions in each frame are more important to solve the task; given the multimodal nature of the data, cross-modality fusion mechanisms, question-answer-aware representations of both the visual and textual data, memory networks based on Neural Turing Machines (NTM), and graph-based reasoning techniques have also been proposed in the literature.

Video and Image Question Answering (VIQA) workshop

Recently, in conjunction with Technologies of Vision (Fondazione Bruno Kessler), cv:hci (Karlsruhe Institute of Technology), and Inria Paris we organized the Video and Image Question Answering (VIQA) workshop at the 25th International Conference on Pattern Recognition (ICPR 2020), which is going to be held in Milan, Italy (January 10-15, 2021). The Call For Papers is out now! Have a look at the flyer here!

Research group

  • Alex Falcon (AILAB-Udine member)
  • Oswald Lanz (TeV – FBK, AILAB-Udine External Collaborator)
  • Giuseppe Serra (AILAB-Udine member)