Knowledge Extraction and Modeling from Scientific Publications


Ronzano F, Saggion H. Knowledge Extraction and Modeling from Scientific Publications. Enhancing Scholarly Data Workshop –SAVE-SD2016


During the last decade the amount of scientific articles available online has substantially grown in parallel with the adoption of the Open Access publishing model. Nowadays researchers, as well as any other interested actor, are often overwhelmed by the enormous and continuously growing amount of publications to consider in order to perform any complete and careful assessment of scientific literature. As a consequence, new methodologies and automated tools to ease the extraction, semantic representation and browsing of information from papers are necessary. We propose a platform to automatically extract, enrich and characterize several structural and semantic aspects of scientific publications, representing them as RDF datasets. We analyze papers by relying on the scientific Text Mining Framework developed in the context of the European Project Dr. Inventor. We evaluate how the Framework supports two core scientific text analysis tasks: rhetorical sentence classification and extractive text summarization. To ease the exploration of the distinct facets of scientific knowledge extracted by our platform, we present a set of tailored Web visualizations. We provide on-line access to both the RDF datasets and the Web visualizations generated by mining the papers of the 2015 ACL-IJCNLP Conference.


Keywords: scientific knowledge extraction, knowledge modeling, RDF, software framework


Additional material:

Ronzano, F., & Saggion, H.: Dr. Inventor Framework: Extracting Structured Information from Scientific  Publications. Discovery Science (pp. 209-220). Springer International Publishing. (2015)

Fisas, B., Ronzano, F., & Saggion, H. (2015). A Multi-Layered Annotated Corpus of Scientific Papers. To appear in the LREC Conference 2016.

This Corpus includes 40 Computer Graphics papers containing 8,877 sentences that have been manually annotated with respect to their scientific discourse rhetorical category. Moreover, the corpus includes for each paper three handwritten summaries of maximum 250 words.

Saggion, H.: SUMMA: A robust and adaptable summarization tool. Traitement Automatique des Langues, 49(2) (2008