Intelligent enrichment of specialized corpora to generate of disease- and cell-specific interactomes in computational systems biology


Several tools for network-based modelling in computational systems biology, such as STRING, MATRIX DB, MyProteinNet, etc., allow the construction of generic interactomes. Usually, the construction process can be contextualised thanks to specific databases and/or terms. Nevertheless, these tools remain too generic and inefficient for specifically targeted cell types and disorders such as cartilage cells and cartilaginous tissue degeneration. In such cases, generic interactomes need to be further optimized against molecular biology measurements. In general, specific interactomes can be defined via manual literature retrieval and analysis. However, such proces leads to very reduced corpora and incomplete information, harming the capacity to achieve robust nodal /topological optimizations. Hence, the objective of the proposed work is to enhance specialized corpora in order to improve their exploitability in disease-specific network modelling, by selectively enriching the information through large and more generic semantically annotated corpora such as GENIA. The work will focus on intervertebral disc degeneration (protein and RNA classes), for which a specific corpus is already available at UPF. It will be based on a neural network architecture and will potentially involve the development /exploitation of transfer learning techniques.  Extension to the use of alternative sources such as PubMed for the enrichment of the specialised corpus will be considered. The project will be co-supervised by Jérôme Noailly from BCN MedTech (biological and disorder-specific significance, systems biology target) and Leo Wanner from the Natural Language Processing Research Group (natural language analysis, machine learning).


Hosting group: SIMBIOSys-MBIOMM (


Supervisors: Jérôme Noailly (BCN MedTech); Leo Wanner (Natural Language Processing Research Group)