Bayesian infinite factorization methods with applications to tracking data in football

Schiavon, L.

Factorization models are a mathematical representation of multidimensional data objects as a collection of simpler components. For instance, a matrix can be characterized as a sum of latent rank one components, where the number of addends is generally much lower than the dimensions of the matrix. Factor models are commonly used across a variety of disciplines to deal with data sets whereby a large number of observed variables is thought to reflect a smaller number of latent variables. However, it can be challenging to infer the relative impact of the different components as well as the number of components. To address this issue, it has become popular to rely on overfitted factorization models that avoid strict constraints on either the number of factors and the ordering of the data. In the Bayesian framework, increasing shrinkage priors on latent elements have been proposed, allowing the introduction of infinitely many factors, albeit with impact decreasing with the component index, such that the unnecessary ones can be adaptively removed by increasingly shrinking their coefficients close to zero as the component index increases. These flexible approaches are usually named infinite factorization models. This thesis aims to provide an overview on infinite factorization models, presenting the state of the art, discussing the limitations of the current models, and gradually composing a general Bayesian infinite factorization framework that includes novel methods to address such deficiencies. In particular, we consider the role of sparsity in the latent low-rank elements, as being crucial to improve the inference and facilitate interpretation. Firstly, we focus on the effect of the sparsity induced by the usual approximation of the infinite model through a truncated version to facilitate the posterior inference. In this regard, it is fundamental to carefully assess how the truncation criterion affects the inference performance and the factor model representation. We propose a novel truncation criterion that relates the level of truncation to the factor contribution to the global data variability, allowing one to easily calibrate the algorithm's parameters. Secondly, we careful investigate the role of local sparsity within the low-rank latent elements by introducing a new general class of infinite factorization models. In this framework, we provide theoretical support to verify desirable shrinkage properties of the prior, including robustness to large signals and the sparsity behaviour to the increasing number of factors or dimension of the data. The main novelty of the proposed class of models lies on the dependence of the local sparse pattern of the latent elements on auxiliary information which is supposed to inform on the similarity among variables, that correspond to columns of the data matrix. This structure enables us to fill a key gap of the current infinite factor models that do not accommodate grouped variables and other nonexchangeable structures. We also propose extending this class to the more general class of matrix decomposition models. Symmetrically to the use of the exogenous information about variables, the matrix decomposition model also embeds auxiliary information about the row entities of the data matrix, enabling us to model the dependence through structured sparse latent elements with respect to both the matrix dimensions. A novel estimation algorithm inspired by boosting approaches is designed, overcoming the computational limits of the current Markov chain Monte Carlo approaches and the nonidentifiability issue which characterizes all the overfitted factorization models. Practical gains with respect to the current state of art are demonstrated in simulation studies and discussed in real data applications, further illustrating benefits in terms of parameter estimations and model interpretation. Football player tracking data represent the common thread of the thesis. They motivate the introduction of the novel methodologies to address the challenges arising from the need of extracting valuable knowledge from a high dimensional dataset representing a complex phenomenon. The amount of information included is such that several aspects of interest can be explored. In this thesis, we focus on three of them: similarities among players, positional and technical predictors of the dangerousness of an action, and player run heatmaps. In all these cases, thoughtful insights and representations are provided, sheding ligth on the potential of our approach. However, the generality of the proposed framework is expected to impact many other application fields.