Same-species Contamination Detection with Variant Calling Information from Next-generation Sequencing
Author(s): Tao Jiang and Alison A. Motsinger-Reif
Background: Same-species contamination detection is an important quality control step in genetic data analysis. Due to a scarcity of methods to detect and correct for this quality control issue, same-species contamination is more difficult to detect than cross-species contamination. We introduce a novel machine learning algorithm to detect same-species contamination in next-generation sequencing data using a support vector machine (SVM) model.
Methods: In the first stage, a change-point detection method is used to identify copy number variations (CNVs) and copy number aberrations (CNAs) for filtering. Next, single nucleotide polymorphism (SNP) data is used to test for same-species contamination using an SVM model. Based on the assumption that alternative allele frequencies in next-generation sequencing follow the beta-binomial distribution, the deviation parameter ρ is estimated by the maximum likelihood method. All features of a radial basis function (RBF) kernel SVM are generated using publicly available or private training data.
Results: We provide an R software implementation of the approach, which we used to conduct simulation experiments with real data to evaluate our approach. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generated variant call format (VCF) files using variants identified in these data and then evaluated the power and false-positive rate. In these real data, the approach detected contamination levels as low as 5% with a reasonable false-positive rate. The results had sensitivity above 99.99% and specificity of 90.24%, even in the presence of degraded samples with similar features as contaminated samples.
Conclusions: Our approach uniquely detects contamination using variant calling information stored in VCF files for DNA or RNA. Importantly, it can differentiate between same-species contamination and mixtures of tumor and normal cells. Accordingly, it represents an important tool that can be applied within the quality control process.