[PhD thesis] Towards the improvement of decision tree learning: a perspective on search and evaluation
[PhD thesis] Towards the improvement of decision tree learning: a perspective on search and evaluation
[PhD thesis] Towards the improvement of decision tree learning: a perspective on search and evaluation
Author: Cecilia Nunes
Supervisor: Óscar Cámara, Anders Jonsson
Data mining and machine learning (ML) are increasingly at the core of many aspects of modern life. With growing concerns about the impact of relying on predictions we cannot understand, there is widespread agreement regarding the need for reliable interpretable models. One of the areas where this is particularly important is clinical decision-making. Specifically, explainable models have the potential to facilitate the elaboration of clinical guidelines and related decision-support tools. The presented research focuses on the improvement of decision tree (DT) learning, one of the most popular interpretable models, motivated by the challenges posed by clinical data. One of the limitations of interpretable DT algorithms is that they involve decisions based on strict thresholds, which can impair performance in the presence noisy measurements. In this regard, we proposed a probabilistic method that takes into account a model of the noise in the distinct learning phases. When considering this model during training, the method showed moderate improvements in accuracy compared to the standard approach, but significant reductions in number of leaves. Standard DT algorithms follow a locally-optimal approach which, despite providing good performances at a low computational cost, does not guarantee optimal DTs. The second direction of research therefore concerned the development of a non-greedy DT learning approach that employs Monte Carlo tree search (MCTS) to heuristically explore the space of DTs. Experiments revealed that the algorithm improved the trade-off between performance and model complexity compared to locally-optimal learning. Moreover, dataset size and feature interactions played a role in the behavior of the method. Despite being used for their explainability, DTs are chiefly evaluated based on prediction performance. The need for comparing the structure of DT models arises frequently in practice, and is usually dealt with by manually assessing a small number of models. We attempted to fill this gap by proposing an similarity measure to compare the structure of DTs. An evaluation of the proposed distance on a hierarchical forest of DTs indicates that it was able to capture structure similarity. Overall, the reported algorithms take a step in the direction of improving the performance of DT algorithms, in particular in what concerns model complexity and a more useful evaluation of such models. The analyses help improve the understanding of several data properties on DT learning, and illustrate the potential role of DT learning as an asset for clinical research and decision-making.
Link to manuscript: http://hdl.handle.net/10803/667879
Thesis carried out in the context of the CardioFunXion Marie Curie Industrial Network coordinated by DTIC-UPF, with the participation of Philips France, and additionally supported by the MdM program