University Heidelberg | SIP RG | Theses

Jan JOHANNES Jul 15,2022 Theses

Thesis:: Master in Mathematics

Author:: Daniel Fridljand

Title:: Better multiple Testing: Using multivariate co-data for hypothesis weighting

Supervisors:: Wolfgang Huber (EMBL); Jan JOHANNES

Abstract:: Consider a multiple testing task, where for each test we have access to its p-value and additional information represented by a univariate or multivariate covariate. The covariates may contain information on prior probabilities of null and alternative hypotheses and/or on the test’s power. As per several recent proposals, the independent hypothesis weighting (IHW; Ignatiadis and Huber, 2021) framework capitalizes on these covariates for the multiple testing procedure. IHW partitions the covariate space into a finite number of bins and learns weights used to prioritize each bin a-priori based on the covariate. IHW guarantees false discovery rate (FDR) control, while increasing the proportion of correct discoveries (power) compared to unweighted methods such as the Benjamini-Hochberg procedure (BH). Ignatiadis and Huber (2021) used per-covariate quantiles for the partition. Limitations of this are that the number of quantile combinations explodes with multiple covariates and they are inappropriate for heterogeneous covariates. Here we propose a random forest-based approach (IHW-Forest), where the leaves of the trees form a partition for the covariates. The objective function is chosen such that the splits are sensitive to the prior probability of a hypothesis being true. IHW-Forest scales well to high-dimensional covariates and can detect small regions with signal. IHW-Forest can deal with heterogeneous covariates and ignores uninformative covariates. The latter is useful in practice when the user does not know which covariates are relevant for the hypotheses under study. This extends the application of IHW by the automatic selection of the most relevant covariate. Lastly, IHW-Forest takes advantage of the p-values to construct the partition, yielding homogeneous bins and hence increasing power. We demonstrate IHW-Forest’s benefits in simulations and in a bioinformatics application. IHW’s power vanishes with an increasing number of covariate dimensions, while IHW-Forest’s power remains stable and well above BH and IHW. With the signal concentrated in a shrinking region, IHW-Forest outperforms BH, IHW and other competing methods in power. We apply IHW-Forest to an hQTL analysis, which looks for associations between genetic variation and histone modifications on the human chromosomes. This resulted in 16 billion tests on the first two chromosomes. We used 11 different covariates, among them the genomic distance. Due to an exponential increase of the number of per-covariate quantiles with the number of covariates, IHW is not applicable anymore, but IHW-Forest is. The updated package will be available from Bioconductor http://www.bioconductor.org/packages/IHW.

Reference:: N. Ignatiadis und W. Huber. Covariate powered cross-weighted multiple testing, Journal of the Royal Statistical Society, Series B, 83(4):720–751, 2021.