How can Natural Language Processing improve access to scientific literature? Tutorial at COLING 2016

During the last decades scientific literature has experimented an exponential growth: every thirteen seconds a new article is published and thus added to the already huge set of more than 2.5 million papers that are currently available online. In this scenario, automated approaches to extract, enrich, aggregate and summarize the content of publications have become essential tools to help researchers and any other interested actor to deal with scientific information overload. Natural Language Processing and Text Mining play a central role as key technologies to enable such automated analyses of scientific literature.

The tutorial "Natural Language Processing for Intelligent Access to Scientific Information", presented at the 26th International Conference on Computational Linguistics (COLING 2016), provides an overview of the main approaches to mine a broad variety of structured, semantic information from scientific articles by focusing on the following four aspects:

  • the extraction of structured textual contents from PDF papers: in spite of the growing adoption of XML-based scientific publishing formats, nowadays PDF files are still used the 80% of times a scientific publication is accessed online. Moreover, a considerable percentage of scientific literature is available only in PDF format. As a consequence, the possibility to effectively extract structured textual contents from PDF articles represents a basic but essential step to bootstrap any more complex analysis. The tutorial provides a thorough overview of the most relevant PDF-to-text conversion tools. General purpose as well as customized tools are reviewed by comparing the distinct approaches (both based on rules and machine learning) adopted to identify and link structural elements of articles including information from the header (title, authors, affiliations), the body, and the bibliography;
  • the analysis of scientific discourse: the automated characterization of the argumentative structure adopted by the authors of a publication to expose, discuss and motivate their work provides valuable information to better process and search for relevant contents in scientific literature.For instance, if we are able to automatically identify the parts of a paper dealing with its novelties or outlining future works, we could easily spot its contributions and eventually compare them with other pieces of work. The tutorial provides an overview of the different approaches and annotation schemas developed to model scientific discourse together with the related collections of scientific papers that have been manually annotated with respect to such schemas. Also several automated approaches to learn to characterize the scientific discourse inside a publications are reviewed and compared;
  • the use of citations to improve access to scientific literature: citations represents a core device of scientific communication since they provide explicit and motivated links among articles. The tutorial provides several examples of analysis of the structure of the network of citation of papers in order to identify communities of interest, experts as well as methods to track the temporal evolution of topics in scientific literature. Not all citations are equal: a paper can be cited as a fundamental source of information, to criticize it or simply to refer to a specific method used. Many schemas and approaches to characterize the purpose and polarity of citations have been proposed and are reviewed and compared in the tutorial;
  • the generation of summaries of scientific publications: the availability of a short text that recaps the core contents and contributions of a paper is essential in order to rapidly and effectively go through big collections of articles selecting the ones to inspect in more details. The tutorial, after providing an overview of general purposed approaches to the creation of document summaries and their evaluation, reviews the relevant methodologies that have been proposed to summarize scientific publications: the citations received and the structure of scientific discourse constitute useful traits to leverage the most relevant contents of articles.

The main scientific information extraction challenges proposed during the last few years are also reviewed during the tutorial as well as the most relevant dataset useful to experiment with new approaches of scientific text mining. Some examples of big data architectures to crawl and process huge volumes of scientific publications are discussed. The Dr. Inventor Text Mining Framework, a java-based library that integrates a varied set of scientific text mining tools, is presented by providing an overview of its architecture and a practical demo of how a scientific publication is processed and which kind of information is extracted. The tutorial was attended by more than thirty people thus providing a chance to discuss in more details current approaches and new ideas.


Relevant links: