Reproducibility: Principles, Problems, Practices, and Prospects
Validity is a fundamental aspect of any machine learning approach. While for supervised learning there is now a plethora of standard methods, there are significantly fewer tools for unsupervised learning. Moreover, the three types of current validity approaches (external, internal, and relative) all have serious drawbacks and are computationally expensive. We discuss why there are so many proposals for clustering algorithms and why they detach from approaches to validity. This leads to the question of whether and how we can validate the results of clustering algorithms. We present a new approach that differs radically from the three families of validity approaches. It consists of translating the clustering validity problems to an assessment of the easiness of learning in the resulting supervised learning instances. We show that this idea meets formal principles of cluster quality measures, and thus the intuition inspiring our approach has a solid theoretical foundation. In fact, it relates to the notion of reproducibility. We contrast our suggestion with prediction strength. Finally, we demonstrate that the principle applies to crisp clustering algorithms and fuzzy clustering methods.
Estivill-Castro V. Why Are There So Many Clustering Algorithms, and How Valid Are Their Results?. In: AA.VV.. Reproducibility: Principles, Problems, Practices, and Prospects. 1 ed. 2015. p. 169-199.