This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a different distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The effectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.

Collective Outlier Detection and Enumeration with Conformalized Closed Testing

Aldo Solari
2024-01-01

Abstract

This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a different distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The effectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.
2024
Volume 230: The 13th Symposium on Conformal and Probabilistic Prediction with Applications, 9-11 September 2024, Politecnico di Milano, Milano, Italy
File in questo prodotto:
File Dimensione Formato  
magnani24a.pdf

accesso aperto

Tipologia: Versione dell'editore
Licenza: Creative commons
Dimensione 108.77 kB
Formato Adobe PDF
108.77 kB Adobe PDF Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5100331
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact