Best summarization system at CL-SciSumm 2018 challenge at BIRNDL by the TALN-UPF team


The UPF team, composed by Ahmed Abura’ed, Alex Bravo and Horacio Saggion, in collaboration with Luis Chiruzzo from Universidad de la República (Uruguay), developed several systems to participate in the challenge, described in “LaSTUS/TALN+INCO @ CL-SciSumm 2018 - Using Regression and Convolutions for Cross-document Semantic Linking and Summarization of Scholarly Literature”, where they obtaind the best performance in task 2 on summarisation against the abstract and human summaries, and the second-best performance against community sumaries. The novelty in the systems proposed by UPF this year (the team has participated in the two previous editions as well) was the use of convolutions and regression in order to identify which parts of a reference paper has been cited by a set of citing papers and finally furnish a summary of the reference paper based on those citations. 

This 2018 task follows up on the successful CLScisumm-17 at the BIRNDL workshop co-located with SIGIR 2017, Tokyo, CLScisumm-16 co-located with JCDL 2016, Rutgers, NJ, USA and the CL Pilot Task conducted as a part of the BiomedSumm Track at the Text Analysis Conference 2014 (TAC 2014). The CL-SciSumm Shared Task is run off the CL-SciSumm corpus, and comprises three sub-tasks in automatic research paper summarization on a new corpus of research papers. A training corpus of forty topics has been released, as well as a test corpus of ten topics. The topics comprise of ACL Computational Linguistics research papers, and their citing papers and t hree output summaries each. The three output summaries comprise: the traditional self-summary of the paper (the abstract), the community summary (the collection of citation sentences ‘citances’) and a human summary written by a trained annotator. Within the corpus, each citance is also mapped to its referenced text in the reference paper and tagged with the information facet it represents (annotated dataset and evaluation scripts available at )

The 2018 challenge had two tasks: Task 1 [A and B] to identify the reference paper parts that are being cited by a citing paper and why (the discourse facet they belong to) and Task2 to generate a summary based on those citations. The organizers provided three types of summaries as output: community (based on the citations), abstract (the traditional self-summary of the reference paper) and human (manual written by an annotator).