Collective Outlier Detection and Enumeration with Conformalized Closed Testing

This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a diﬀerent distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The eﬀectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.

Collective Outlier Detection and Enumeration with Conformalized Closed Testing

Chiara Gaia Magnani;Matteo Sesia;Aldo Solari

2024-01-01

Abstract

This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a diﬀerent distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The eﬀectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2024
			
	Titolo del Volume
	
				Volume 230: The 13th Symposium on Conformal and Probabilistic Prediction with Applications, 9-11 September 2024, Politecnico di Milano, Milano, Italy
			
	Appare nelle tipologie:
	
				3.1 Articolo su libro

File in questo prodotto:

File	Dimensione	Formato
magnani24a.pdf accesso aperto Tipologia: Versione dell'editore Licenza: Creative commons Dimensione 108.77 kB Formato Adobe PDF Visualizza/Apri	108.77 kB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5100331

Citazioni

ND

ND

ND

social impact