Online tool for PDF-to-XML conversion


You can use our freely online tool to parse yours PDF files. Our approach is based on the PDFdigest tool, a PDF textual content extraction system specially designed to extract scientific articles' headings and logical structure (title, authors, abstract and so on) and its textual content to. The result is provided in a XML file. Furthermore, PDFdigest also provides a structured HTML file as a clone of the original PDF file.

The tool is available in
You can only process a maximun of 10 PDF Files at a time. If you have a large amount of PDF files, please contact us.

  1. Saggion H, Ronzano F, Accuosto P, Ferrés D. MultiScien: a bi-lingual natural language processing system for mining and enrichment of scientific collections. InMayr P, Chandrasekaran MK, Jaidka K, editors. Proceedings of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017); 2017 Aug 11; Tokyo, Japan.[place unknown]: CEUR Workshop Proceedings; 2017. p. 26-40. 2017. CEUR Workshop Proceedings.
    Web Link, BibTeX, EndNote, RefMan and RefWorks.

  2. Ronzano F, Saggion H. Dr. inventor framework: Extracting structured information from scientific publications. InInternational Conference on Discovery Science 2015 Oct 4 (pp. 209-220). Springer International Publishing.
    Web Link, BibTeX, EndNote, RefMan and RefWorks.
Do you have any requests, questions, comments or suggestions?

Please, let us know by sending an email to: horacio.saggion AT