Data transformation is an important step in Machine Learning pipelines which can strongly improve their performance. For instance, min-max normalization is often used to make all variables lie in the same range, while log-transformation is used to map data that is scattered across several orders of magnitude to a logarithmic space. Such transformations can be beneficial when the machine learning approach measures distance in a metric space, such as cluster-based approaches. These two transformation approaches can be combined to reveal hidden patterns in the data in the case of log-normally distributed data points, which commonly occur in biological and medical data. In this work we introduce a novel evolutionary approach designed to automatically determine the optimal log-transformation and selection of variables. Our approach is built around an interpretable AI system (created by pyFUME), so that all transformations are followed by inverse transformations to map back the values into the original universe of discourse, and preserve the interpretability of the results. We test our approach on two synthetic datasets, designed to reproduce a condition in which some variables are normally distributed, some variables are log-normally distributed, and some variables are just noise in the dataset. Our results show that our approach yields better performing models compared to conventional methods, and that the resulting model is also characterised by a better interpretability, making such approach particularly useful to study biomedical datasets.

The Impact of Variable Selection and Transformation on the Interpretability and Accuracy of Fuzzy Models

Nobile, Marco S.
2022

Abstract

Data transformation is an important step in Machine Learning pipelines which can strongly improve their performance. For instance, min-max normalization is often used to make all variables lie in the same range, while log-transformation is used to map data that is scattered across several orders of magnitude to a logarithmic space. Such transformations can be beneficial when the machine learning approach measures distance in a metric space, such as cluster-based approaches. These two transformation approaches can be combined to reveal hidden patterns in the data in the case of log-normally distributed data points, which commonly occur in biological and medical data. In this work we introduce a novel evolutionary approach designed to automatically determine the optimal log-transformation and selection of variables. Our approach is built around an interpretable AI system (created by pyFUME), so that all transformations are followed by inverse transformations to map back the values into the original universe of discourse, and preserve the interpretability of the results. We test our approach on two synthetic datasets, designed to reproduce a condition in which some variables are normally distributed, some variables are log-normally distributed, and some variables are just noise in the dataset. Our results show that our approach yields better performing models compared to conventional methods, and that the resulting model is also characterised by a better interpretability, making such approach particularly useful to study biomedical datasets.
2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
File in questo prodotto:
File Dimensione Formato  
The_Impact_of_Variable_Selection_and_Transformation_on_the_Interpretability_and_Accuracy_of_Fuzzy_Models.pdf

non disponibili

Tipologia: Documento in Post-print
Licenza: Accesso chiuso-personale
Dimensione 2.75 MB
Formato Adobe PDF
2.75 MB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/10278/5003620
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact