

We first assume there are no tied predictions or events. Here, we present an implementation with time complexity (and space complexity) that can introduce over 10 000 speedup for biobank-scale data relative to several R packages, and over 10 speedup compared to existing time complexity (and space complexity) algorithm implemented in the survival analysis package by Therneau and Lumley (2014). The time-to-event data for survival analysis include age of disease onset, progression from disease diagnosis to another more severe outcome, like surgery or death. As population-scale cohorts, like UK Biobank, Million Veterans Program, and FinnGen, aggregate time-to-event data for survival analysis it is increasingly important to consider the computational costs of statistics like C-index to build and evaluate predictive models. Several frequently used C-index computational algorithms, including the first algorithm we tried, have time complexity. Most of these expensive, out-of-core matrix-vector multiplications are replaced with fast, in-memory ones that work on much smaller subsets of the data. Our algorithm exploits the sparsity structure in the problem to reduce the frequency of this operation to mostly once or twice for several s. Without variable screening, cyclic coordinate descent (or proximal gradient descent) would require from a few to tens or even hundreds of such matrix-vector multiplications for one. With the highly optimized out-of-core matrix-vector multiplication function PLINK2 provides, we are able to run one single such operation in about 2–3 min. To be more concrete, in one of our simulation studies on the UK Biobank data, the training data take about 2 TB. In particular, snpnet-Cox uses cyclical coordinate descent implemented in glmnet.Įven if these packages do support out-of-core computation, using them directly would be computationally inefficient. On the other hand, most optimization strategies used in these packages can also be incorporated in the fitting step of our algorithm. To the best of the authors’ knowledge, our method is the first to solve regularized Cox regression with larger-than-memory data. However, all of these packages require loading the entire data matrix to the memory, which is infeasible for Biobank-scale data. R packages such as glmnet ( Friedman and others, 2010), penalized ( Goeman, 2010), coxpath ( Park and Hastie, 2007), and glcoxph ( Sohn and others, 2009) solve Lasso Cox regression problem using various strategies. The Lasso ( Tibshirani, 1996) is an effective tool for high-dimensional variable selection and prediction.

While memory-mapping techniques allow users to perform computation on data outside of RAM (out-of-core computation) relatively easily ( Kane and others, 2013), popular optimization algorithms require repeatedly computing matrix-vector multiplications involving the entire data matrix, resulting in slow overall speed. Loading this data matrix to R takes around TB of memory, which exceeds the size of a typical machine’s RAM.

For example, the UK Biobank dataset ( Sudlow and others, 2015) contains millions of genetic variants for over 500 000 individuals. In today’s applications, it is common to have dataset with millions of observations and variables. Computational challenges in large-scale and high-dimensional survival analysis We note that our algorithm can be easily adapted to other applications with arbitrarily large dataset, provided that the Lasso solution is sufficiently sparse.ġ.2. We generate improved predictive models with sparse solutions using genetic data with the number of variables selected ranging from a single active variable in the set and others with almost 2000 active variables. We apply the method to 306 time-to-event disease outcomes from UK Biobank combined with genetic data. Based on the Batch Screening Iterative Lasso (BASIL), we develop an algorithm to fit a Cox proportional hazard model by maximizing the Lasso partial likelihood function. Survival analysis faces computational and statistical challenges when the predictors are ultrahigh-dimensional (when feature dimension is greater than the number of observations) and large scale (when the data matrix does not fit in the memory). Cox proportional hazard model ( Cox, 1972) provides a flexible mathematical framework to describe the relationship between the survival time and the features, allowing a time-dependent baseline hazard. Survival analysis involves predicting time-to-event, such as survival time of a patient, from a set of features of the subject, as well as identifying features that are most relevant to time-to-event.
