Arabic Keyphrase Extraction

Arabic keyphrase extraction is a crucial task due to the significant and growing amount of Arabic text on the web generated by a huge population. It is becoming a challenge for the community of Arabic natural language processing because of the severe shortage of resources and published processing systems. In this paper we propose a deep learning based approach for Arabic keyphrase extraction that achieves better performance compared to the related competitive approaches. We also introduce the community with an annotated large-scale dataset of about 6000 scientific abstracts which can be used for training, validating and evaluating deep learning approaches for Arabic keyphrase extraction. Related publications: Helmy M., Vigneshram R. M., Serra G., Tasso C. Applying Deep Learning for Arabic Keyphrase Extraction. In: Proc. of the 4th International Conference on Arabic Computational Linguistics (ACLing 2018), November 17-19 2018, Dubai, United Arab Emirates. Resources: Arabic Abstracts Dataset


Egocentric Vision for Detecting Social Relationships

Social interactions are so natural that we rarely stop wondering who is interacting with whom or which people are gathering into a group and who are not. Nevertheless, humans naturally do that neglecting that the complexity of this task increases when only visual cues are available. Different situations need different behaviors: while we accept to stand in close proximity to strangers when we at- tend some kind of public event, we would feel uncomfortable in having people we do not know close to us when we have a coffee. In fact, we rarely exchange mutual gaze with people we are not interacting with, an important clue when trying to discern different social clusters. We address the problem of partitioning people in a video sequence into socially related groups from an egocentric vision (from now on, ego-vision) perspective. Human behavior is by no means random: when interacting with each other we generally stand in determined positions to avoid occlusions in our […]


Automatic Keyphrase Extraction

Keyphrases (KPs) are phrases that “capture the main topic discussed on a given document”. More specifically, KPs are phrases typically one to five words long that appear verbatim in a document, and can be used to briefly summarize its content. The task of finding such KPs is called Automatic Keyphrase Extraction (AKE). Recently, AKE has received a lot of attention, because it has been successfully used in many natural language processing (NLP) tasks, such as text summarization, document clustering, or non-NLP tasks such as social network analysis or user modeling. AKE  approaches have been also applied in Information Retrieval of relevant documents in digital document archives which can contain heterogeneous types of items, such as books articles, papers etc. However, given the wide variety of lexical, linguistic and semantic aspects that can contribute to define a keyphrase, it difficult to design hand-crafted feature, and even the best performing algorithms hardly reach F1-Scores of 50% on the most common evaluation sets. […]


Predicting the Usefulness of Amazon Reviews Using Argumentation Mining

Argumentation is the discipline that studies the way in which humans debate and articulate their opinions and beliefs. Argumentation mining is a research area at the cross-road of many fields, such as computational linguistics, machine learning, artificial intelligence, natural-language processing. The main goal of argumentation mining is the automatically extraction and identification of arguments and their relations from natural language text documents. Internet users generate content at unprecedented rates. Building intelligent systems capable of discriminating useful content within this ocean of information is thus becoming a urgent need. In this paper, we aim to predict the usefulness of Amazon reviews, and to do this we exploit features coming from an off-the-shelf argumentation mining system. We argue that the usefulness of a review, in fact, is strictly related to its argumentative content, whereas the use of an already trained system avoids the costly need of relabeling a novel dataset. Results obtained on a large publicly available corpus support this hypothesis. Related […]


Local Pyramidal Descriptors for Image Recognition

In this paper we present a novel method to improve the flexibility of descriptor matching for image recognition by using local multiresolution pyramids in feature space. We propose that image patches be represented at multiple levels of descriptor detail and that these levels be defined in terms of local spatial pooling resolution. Preserving multiple levels of detail in local descriptors is a way of hedging one’s bets on which levels will most relevant for matching during learning and recognition. We introduce the Pyramid SIFT (P-SIFT) descriptor and show that its use in four state-of-the-art image recognition pipelines improves accuracy and yields state-of-the-art results. Our technique is applicable independently of spatial pyramid matching and we show that spatial pyramids can be combined with local pyramids to obtain further improvement. We achieve state-of-the-art results on Caltech-101 (80.1%) and Caltech-256 (52.6%) when compared to other approaches based on SIFT features over intensity images. Our technique is efficient and is extremely easy to integrate into […]


Human action categorization in unconstrained videos

Building a general human activity recognition and classification system is a challenging problem, because of the variations in environment, people and actions. In fact environment variation can be caused by cluttered or moving background, camera motion, illumination changes. People may have different size, shape and posture appearance. Recently, interest-points based models have been successfully applied to the human action classification problem, because they overcome some limitations of holistic models such as the necessity of performing background subtraction and tracking. We are working at a novel method based on the visual bag-of-words model and on a new spatio-temporal descriptor. First, we define a new 3D gradient descriptor that combined with optic flow outperforms the state-of-the-art, without requiring fine parameter tuning. Second, we show that for spatio-temporal features the popular k-means algorithm is insufficient because cluster centers are attracted by the denser regions of the sample distribution, providing a non-uniform description of the feature space and thus failing to code other informative […]


GOLD: Gaussians of Local Descriptors for image representation

The Bag of Words paradigm has been the baseline from which several successful image classification solutions were developed in the last decade. These represent images by quantizing local descriptors and summarizing their distribution. The quantization step introduces a dependency on the dataset, that even if in some contexts significantly boosts the performance, severely limits its generalization capabilities. Differently, in this paper, we propose to model the local features distribution with a multivariate Gaussian, without any quantization. The full rank covariance matrix, which lies on a Riemannian manifold, is projected on the tangent Euclidean space and concatenated to the mean vector. The resulting representation, a Gaussian of Local Descriptors (GOLD), allows to use the dot product to closely approximate a distance between distributions without the need for expensive kernel computations. We describe an image by an improved spatial pyramid, which avoids boundary effects with soft assignment: local descriptors contribute to neighboring Gaussians, forming a weighted spatial pyramid of GOLD descriptors. In […]


Video event classification using bag-of-words and string kernels

The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However it does not model the temporal information of the video stream. We are working at a novel method  to introduce temporal information within the BoW approach by modeling a video clip as a sequence of histograms of visual features, computed from each frame using the traditional BoW model. The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel (e.g using the Needlemann-Wunsch edit distance). Experimental results, performed on two domains, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach. Related Publication: Ballan, L., M. Bertini, A. Del Bimbo, and G. Serra, “Video Event […]


A SIFT-based forensic method for copy-move detection

In many application scenarios digital images play a basic role and often it is important to assess if their content is realistic or has been manipulated to mislead watcher’s opinion. Image forensics tools provide answers to similar questions. We are working on a novel method that focuses in particular on the problem of detecting if a feigned image has been created by cloning an area of the image onto another zone to make a duplication or to cancel something awkward. The proposed approach is based on SIFT features and allows both to understand if a copy-move attack has occurred and which are the image points involved, and, furthermore, to recover which has been the geometric transformation happened to perform cloning, by computing the transformation parameters. In fact when a copy-move attack takes place, usually an affine transformation is applied to the image patch selected to fit in a specified position according to that context. Our experimental results confirm that the […]


Vidi Video: Interactive semantic video search with a large theasurus of machine-learned audio-visual concepts

Video is vital to society and economy. It plays a key role in the news, cultural heritage documentaries and surveillance, and it will soon be the natural form of communication for the Internet and mobile phones. Digital video will bring more formats and opportunities and it is certain the the consumer and the professional need advanced storage and search technology for the management of large-scale video assets. This project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine. Vidi Video will boost the performance of video search by forming a 1000 element of thesaurus detecting instances of audio, visual or mixed-media content. The consortium presents excellent expertise and resource: the machine learning with active 1-class classifiers to minimize the need for annotated examples is lead by the University of Surrey, UK. Video stream processing is lead by Centre For Research and Techonolgy Hellas, Greece. Another component is audio event detection, lead by INESC-ID, Portugal. Visual image processing is lead by […]


Metric target tracking

In the context of visual surveillance one of the most important problem is the observation of human activity. This problem is greatly simplified when metric information can be computed. The goal of this project is to development and test new algorithms to determine metric information automatically by observing the scene. A system to find the metric information by tracking a moving person on a ground has been developed. This algorithm consists in a method for calibration of two cameras based on features of a moving person in their common field of view. Only the image of foot and head locations are used. In fact these points and their geometric relationship between cameras give enough information to find their relative position and orientation and the internal parameters of each camera, the focal length and the principal point. In particular the proposed method works under the assumption that the scene needs to be modeled well with a dominant ground plane and the […]