Artificial Intelligence, and Machine Learning systems in general, are becoming pervasive in our society, from the industry to the public administration. AI can often provide a very efficient means to support decision-making, but it can represent a danger for high-risk applications such as bio-medicine and healthcare. In particular, biased datasets might lead to inaccurate or discriminatory ML systems, undermining the accuracy of their predictions and putting patients' health at risk. FanFAIR is a python tool that provides the community with a semi-automatic tool for datasets' fairness assessment. FanFAIR is designed to integrate qualitative considerations - such as ethics, human rights assessment, and data protection - with quantitative indicators of dataset's fairness, such as balance, the presence of invalid entries, or outliers. In this work, we extend FanFAIR to deal with categorical data, and introduce a new algorithm for outlier detection in the presence of missing values. We then provide a case study on the data collected from COVID patients admitted to pneumology departments in Italy. We show how the successive steps of data cleaning and variable selection improve the indicators provided by FanFAIR. This shows that data cleaning procedures are not only necessary to improve the performance of the machine learning algorithm using the data for learning, but are also a way to improve (a measure of) fairness. Hence, the proposed case study provides an example in which performance and fairness are not in contrast, like it is commonly believed to be, but they improve together.
Investigating Fairness with FanFAIR: is Pre-Processing Useful Only for Performances?
Nobile, Marco S.;Confalonieri, Marco;Gallese, Chiara
2025-01-01
Abstract
Artificial Intelligence, and Machine Learning systems in general, are becoming pervasive in our society, from the industry to the public administration. AI can often provide a very efficient means to support decision-making, but it can represent a danger for high-risk applications such as bio-medicine and healthcare. In particular, biased datasets might lead to inaccurate or discriminatory ML systems, undermining the accuracy of their predictions and putting patients' health at risk. FanFAIR is a python tool that provides the community with a semi-automatic tool for datasets' fairness assessment. FanFAIR is designed to integrate qualitative considerations - such as ethics, human rights assessment, and data protection - with quantitative indicators of dataset's fairness, such as balance, the presence of invalid entries, or outliers. In this work, we extend FanFAIR to deal with categorical data, and introduce a new algorithm for outlier detection in the presence of missing values. We then provide a case study on the data collected from COVID patients admitted to pneumology departments in Italy. We show how the successive steps of data cleaning and variable selection improve the indicators provided by FanFAIR. This shows that data cleaning procedures are not only necessary to improve the performance of the machine learning algorithm using the data for learning, but are also a way to improve (a measure of) fairness. Hence, the proposed case study provides an example in which performance and fairness are not in contrast, like it is commonly believed to be, but they improve together.File | Dimensione | Formato | |
---|---|---|---|
Investigating_Fairness_with_FanFAIR_is_Pre-Processing_Useful_Only_for_Performances.pdf
non disponibili
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
466.82 kB
Formato
Adobe PDF
|
466.82 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.