Receiver Operating Characteristic and Recursive Linear Modeling for Gene Expression Analysis and Its Application to Alzheimer’s Disease Diagnosis by Peripheral Blood Mononuclear Cells
Author(s): Aibing Rao
Background and objectives: Microarray and RNA-Seq for gene expression analysis often generate expression matrix of very high dimension. In biomarker discovery study, a small set of important genes is the objective for data analysis and discovery. Innovative and effective gene discovery algorithm is always desirable. In this study we introduced a novel gene expression modeling algorithm, called Recursive Linear Modeling (RLM). By combining single-variate analysis and RLM for peripheral blood mononuclear cells (PBMC) gene expression analysis, we established and validated two prediction models for alzheimer’s disease (AD) and mild cognitive impairment (MCI) diagnosis.
Methods: Publicly available PBMC gene expression data sets for AD/MCI were used to develop and demonstrate the algorithm. By comparing the AD/MCI group to the healthy control (HC) group respectively, firstly, each gene was analyzed as a single-variate predictor using ROC (receiver operating characteristic), and a heuristic gene candidate set was selected from the top according to AUC (area under the curve) in the decreasing order. Secondly, for a given model size (number of genes in a model), the candidate set was searched by a recursive linear modeling procedure. At last, an optimal model size and the corresponding model was determined by the maximal R-square among all sizes.
Results: An AD prediction 30 gene model was established and validated with high specificity and sensitivity: SS18L2, ATP6V1G1, GIMAP7, OSBPL1A, C14orf166, UQCRH, USP3, STAT6, MFSD10, HELZ, FLT3, CBX7, PEPD, FGF7, ESD, REST, TM9SF3, ZNF264, LPAR1, CTGF, EML4, BTBD10, MED31, FCGRT, TAF12, SEC11C, FCER2, FASTKD2, RPS27A, RPS27. Its model building AUC is 0.98 and the validation AUC is 0.93; In parallel, an MCI prediction 23 gene model of similar performance was also established and validated: ULK1, UBL3, TPST2, EEF1A1, FAM21A, RAN, LCOR, NOD1, OSBPL1A, SARS, PAQR4, EGFL6, RPS23, SDHB, TFB1M, ZNF416, TRIP11, SEC22B, SELK, SDHC, SIPA1, ZSCAN21, OSGEPL1. Its model building AUC is 0.96 and the validation AUC is 0.88. The models may be used to develop accurate AD/MCI clinical diagnosis and early risk assessment.
Conclusions: A novel feature selection and model building method by combining single-variate analysis using ROC and recursive linear modeling was developed and its application to AD/MCI prediction based on PBMC expression data showed great accuracy. The method is very general and can be used to build models for other gene expression biomarker discovery studies.