This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a different distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The effectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.
Collective Outlier Detection and Enumeration with Conformalized Closed Testing
Aldo Solari
2024-01-01
Abstract
This paper develops a distribution-free method for collective outlier detection and enumeration, designed for situations in which the precise identification of individual outliers may be impractical due to the sparsity or weakness of their signals. This method builds upon the latest developments in conformal inference and blends them with more classical ideas from other areas, including multiple testing, rank tests, permutations, and non-parametric large-sample asymptotics. Key innovations include an extension of the Wilcoxon-Mann-Whitney test, which may be of some independent interest, and a principled algorithm for tuning the choices of machine learning classifier and two-sample testing procedure utilized by our method, yielding to an adaptive strategy. Assuming to have a control sample where all the observations are drawn independently from the same distribution (inlier distribution) and a test sample where possibly some observations are extracted from a different distribution (outlier distribution), our methodology implements the closed testing procedure providing simultaneous inference on the number of outliers in the test sample or in any subset of the test set. The inferential result produced by our method is a (1−α)-confidence lower bounds for the number of true outliers after any selection of the data in the test set. Further, we motivate theoretically the choice of the extended Wilcoxon-Mann-Whitney tests as local test in the closed testing procedure, studying their optimality and deriving interesting findings under distribution-free alternatives. Delving into how local optimality transfers to the closed testing procedure is prompt for future research. The effectiveness of our method is highlighted through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.| File | Dimensione | Formato | |
|---|---|---|---|
|
magnani24a.pdf
accesso aperto
Tipologia:
Versione dell'editore
Licenza:
Creative commons
Dimensione
108.77 kB
Formato
Adobe PDF
|
108.77 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



