The rapid expansion of technology in medical and biological domains has led to a surge in available data, accompanied by increased complexity. Classical statistical methods, primarily developed for analyzing data assumed to follow a normal distribution, may prove inadequate for handling this complexity. We focus our attention on transcriptomic data, in particular single-cell RNAseq data, which measure gene expression as counts rather than intensities. Through a simulation study, this study aims to demonstrate the limitations of common data transformations intended to fit data into a normal distribution, highlighting potential misinterpretations. Simulation results show that the transformation of the data might affect the results and their interpretation. In particular, we show that in a two-sample problem in a graphical framework, the set of nodes that differs in the two conditions is affected by the transformations. This behavior might be due to the fact that transformations can distort graphical structures reliant on conditional dependencies among variables, affecting final conclusions. Specifically, when analyzing RNA-seq data, methods tailored for count data are preferable over those designed for normally distributed data. The findings underscore the need for specialized approaches in handling count data in two-sample problems and advocate for further research into alternative methods for differential network analysis.

Data Transformation and Its Validity in a Two-Sample Problem: An Illustration Based on Graphical Models

Djordjilovic V.
2025-01-01

Abstract

The rapid expansion of technology in medical and biological domains has led to a surge in available data, accompanied by increased complexity. Classical statistical methods, primarily developed for analyzing data assumed to follow a normal distribution, may prove inadequate for handling this complexity. We focus our attention on transcriptomic data, in particular single-cell RNAseq data, which measure gene expression as counts rather than intensities. Through a simulation study, this study aims to demonstrate the limitations of common data transformations intended to fit data into a normal distribution, highlighting potential misinterpretations. Simulation results show that the transformation of the data might affect the results and their interpretation. In particular, we show that in a two-sample problem in a graphical framework, the set of nodes that differs in the two conditions is affected by the transformations. This behavior might be due to the fact that transformations can distort graphical structures reliant on conditional dependencies among variables, affecting final conclusions. Specifically, when analyzing RNA-seq data, methods tailored for count data are preferable over those designed for normally distributed data. The findings underscore the need for specialized approaches in handling count data in two-sample problems and advocate for further research into alternative methods for differential network analysis.
2025
Methodological and Applied Statistics and Demography III. SIS 2024. Italian Statistical Society Series on Advances in Statistics.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5099992
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact