Data Science

Study plan

Presentation

To prepare students for a career in biomedical engineering, it is essential that they master several techniques from data science. These techniques are really the building blocks of biomedical engineering, and most projects (academic as well as commercial) involve at least one data science technique.

With this in mind, the Data Science course introduces fundamental concepts and mathematical tools of data science. The aim is to give students an introduction to the most common data science techniques, both at a theoretical and at a practical level, preparing them for a multitude of data analysis tasks in biomedical engineering.

The course is divided into four topics, which together cover most data science techniques usually applied in computational biomedical engineering: 1) hypothesis testing; 2) unsupervised learning; 3) supervised learning; and 4) deep learning. Each topic consists of a theory session which explains key concepts, and a lab session in which students obtain hands-on experience with data science algorithms.

Prerequisites

The course assumes familiarity with Matlab and Python, as well as with basic knowledge of statistics. Some background in linear algebra and calculus is also required.

Theory

1: Hypothesis testing

During the first part of the class basic concepts of statistics (e.g. average, standard deviation and variance) will be introduced together with the assumption of normal distribution in biological data. Null hypothesis and "acceptable errors" will also be defined.

The class will proceed with the definition of a statistical model that can help the students describe a physical/biological event. Based on such a model the analysis of variance will be introduced (one factor ANOVA). During the last part of the class the application of the ANOVA to two or more factors will be presented together with the concept of interaction. The course will also teach to handle repeated tests and the relative propagation of the error.

2: Unsupervised learning

This part first introduces some basic concepts about machine learning (e.g. the learning model, the different types of learning, feature representation, etc.) and then focuses on how unsupervised learning can be used to find structure in unlabeled data, to build a model or find a useful representation of the data. Applications of unsupervised learning in biomedical engineering are also discussed. Topics covered include methods for dimensionality reduction (e.g. principal component analysis, linear and non-linear manifold learning) and clustering (e.g. k-means, hierarchical clustering). The class also discusses how this type of learning is related to semi- or supervised learning.

3: Supervised learning

The class introduces basic components of supervised learning: the dataset of labelled examples, the hypothesis set, and the learning algorithm. The class then describes the simplest form of learning algorithms, namely linear models (the perceptron learning algorithm, linear regression and logistic regression). More advanced learning algorithms are also introduced, such as decision trees and support vector machines.

Moreover, the class discusses several important concepts related to supervised learning. The aim is to generalize knowledge to unseen examples, and complex models usually lead to overfitting. To successfully apply supervised learning it is essential to manage the problem of overfitting, and several techniques for doing so are discussed, such as regularization and validation. Non-linear transformation can be used to adapt linear models to more complex datasets that are not inherently linear.

4: Deep learning

The class first introduces the building blocks of deep learning: the artificial neuron and the feedforward neural network. Several types of activation functions are described. The backpropagation algorithm for training the weights of feedforward neural networks is explained. The class also describes several alternatives to fully connected networks, such as convolutional networks and residual networks.

The class then introduces several versions of recurrent neural networks: Elman/Jordan networks, LSTMs and Restricted Boltzmann Machines. Algorithms for training recurrent neural networks are described. The class discusses how to solve problems using deep learning and how to overcome the problem of overfitting (e.g. using dropout regularization).

In the last part, the class provides a brief introduction to reinforcement learning and discusses how it is related to deep learning.

Laboratory sessions

In addition to theory classes, each of the four course topics comprises one or more lab sessions. In each lab session, students will be presented with one or several practical tasks related to the data science topic in question. To solve each task, students have to implement code in Matlab and/or Python, possibly taking advantage of existing machine learning libraries to execute the appropriate algorithm. To complete the lab, students have to upload their source code together with a report describing the work done for the lab session.

Evaluation

The evaluation consists of lab exercises for each module and a final exam. The lab exercises account for 60% of the final grade, while the exam accounts for 40%. Each lab exercise can be carried out individually or in groups of two students (recommended), while the final exam is individual. To pass the course, it is necessary to: 1) pass each part separately (labs and exam) with a minimum grade of 5, and 2) pass each module (hypothesis testing, unsupervised learning, supervised learning, and deep learning) with a minimum grade of 5 in the exam.

Bibliography and information resources

T. Mitchell. Machine Learning. McGraw-Hill, Inc. New York, 1997
C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006
I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. MIT Press, Cambridge, 2016
Mazziotta, J. C., Toga, A. W., Evans, A., Fox, P., & Lancaster, J. (1995). A probabilistic atlas of the human brain: Theory and rationale for its development: The international consortium for brain mapping (icbm). Neuroimage, 2(2), p. 89-101.
Blezek, D. J., & Miller, J. V. (2007). Atlas stratification. Medical image analysis, 11(5), p. 443-457.
Aljabar, P., Heckemann, R. A., Hammers, A., Hajnal, J. V., & Rueckert, D. (2009). Multi-atlas based segmentation of brain images: atlas selection and its effect on accuracy. Neuroimage, 46(3), p. 726-738.
Iglesias, J. E., & Sabuncu, M. R. (2015). Multi-atlas segmentation of biomedical images: a survey. Medical image analysis, 24(1), p. 205-219.
Guo, Y., Zhao, G., & Pietikäinen, M. (2016). Dynamic Facial Expression Recognition With Atlas Construction and Sparse Representation. IEEE Transactions on Image Processing, 25(5), p. 1977-1992.
Booth, J., Roussos, A., Ponniah, A., Dunaway, D., & Zafeiriou, S. (2017). Large Scale 3D Morphable Models. International Journal of Computer Vision, p. 1-22.
Montgomery, Douglas C. Design and analysis of experiments. Wiley, 2012