Deep Document Summarization: Combining Text and Images

Most text summarization methods linguistically analyse the text of a document and, in occasions, related contextual information to identify relevant sentences for a summary. However, a document (e.g. encyclopaedic article, scientific article, news) is a complex record of knowledge which includes, in addition to text, graphical elements, all of them relevant, and many times, necessary for our understanding of the contents the document. Pictorial representations have been scarcely used to assess the relevance of the information in text summarization. However, visual information has the potential to help automatic systems in spotting essential textual content which could be used for summary generation. But, which pictorial elements are central and which are not in a document? We know very little about this interesting question. Since graphical elements are usually described with text, an interesting, and not well understood problem to address is the identification of sentences or sentence fragments (scope identification) in the document which accurately describe a given graphical element. All these are factors worth investigating in a new approach to text summarization which combines Deep Learning in NLP and image processing and which we propose to investigate. Certainly, current automatic image classification and labelling methods will be essential in this inquiry.

The top level objective of this research is to understand if and how pictorial representations (in a broad sense) found in documents could assist in text summarization. This includes the following high level goals:

(i) Understand which pictorial elements in a document convey essential information,

(ii) Propose and implement methods to interpret pictorial representations occurring in documents with natural language;

(iii) Propose a method based on current deep learning to enhance current text summarization methods;

(iv) Evaluate the proposed research in a rationally designed evaluation framework.

Infrastructure to develop the project and available know how

The research will be developed at DTIC / Universitat Pompeu Fabra (Barcelona, Spain) under the supervision of Dr. Horacio Saggion, head of the Large Scale Text Understanding Systems Lab, with recognized experience in natural language processing and more specifically in scientific text summarization. The research is in line with our long term vision in scientific text understanding and our current commitment with the "Maria de Maeztu" excellence program in data intensive knowledge extraction. The DTIC has hardware (High Performance Computer Cluster + NVIDIA graphic processing units) and software infrastructure to undertake this research.

Candidate Profile and How to Apply

In the context of the 2020 PhD InPhiNIT La Caixa program associated to the Maria de Maeztu Strategic Research Program, we are looking for a highly motivated PhD candidate in the area of Natural Language Processing to work in a project which will advance the state of the art of text summarization by considering summarization of multimodal documents which contain text and images.

The PhD will be carried out at the TALN research group of the Department of Information and Communication Technologies (DTIC), Universitat Pompeu Fabra (UPF) in Barcelona.

The PhD student should have background in Natural Language Processing with a solid knowledge of statistics, mathematics, computer programming, and machine learning, in particular knowledge of current methods in Deep Learning would be highly appreciated. . Experience in Information Extraction, Text Summarization, or related areas would be appreciated.

How and where to apply:

If you are interested in this position you should contact Prof. Horacio Saggion (e-mail, twitter)

Applications are managed by La Caixa, please check here:

Application period: **** until begining of February 2020 (please check!) ***

The Large Scale Text Understanding Systems Lab:

https://www.upf.edu/web/taln/large-scale-text-understanding-systems

Horacio Saggion Profiles

Google Scholar

DBLP Index

The TALN research group:

http://taln.upf.edu/

Maria de Maeztu Strategic Research at DTIC:

Related information:

Dr Inventor Text Mining Library

SUMMA Summarization Library

PDF Digest

SEPLN Anthology Exploration Browser