Although next-generation sequencing and microarray service is becoming more and more affordable every year, these projects still represent a significant investment of time and budget. There are several strategies researchers resort to in order to conserve their limited budgets and time, but some of these strategies have shown to result in poor data output and are therefore counterproductive. An investigator can be left with unusable data after investing a substantial amount of resources.
One such strategy, that can often impinge the success of a profiling experiment, is the pooling of biological replicates. Some researchers may be under the impression that pooling biological replicates, instead of treating each replicate as an individual sample, is a statistically acceptable strategy for discovery of differentially expressed genes (DEGs). A recent study by researchers at Aarhus University demonstrated that this is not an effective strategy for minimizing costs. They evaluated the validity of two pooling strategies (3 or 8 biological replicates/pool; two pools/group) by comparing next-gen sequencing data from RNA pools with the data obtained from corresponding, individual RNA samples (3 or 8 samples/group). Agreement between the analyses of RNA-pools and of corresponding individual samples was weak and differential expression of most of these genes detected from the RNA-pools data was not corroborated by the analyses of corresponding individual samples. Despite having good sensitivity and specificity, both pooling strategies displayed poor positive predictive values which undermined their ability to predict true-positive DEGs.
Agreement between sequencing RNA-pools and sequencing corresponding individual RNA samples.
In effect, “pooled samples did not represent the population variations in gene expression levels”, which led “to erroneously long DEG lists with low positive predictive values”. From this evidence, it’s easy to see how incorporating an appropriate amount of un-pooled biological replicates in one’s initial study is ultimately more cost-effective in the long-run, as it prevents investment of time and money in attempts at validating false results.
Ideally, a researcher would always incorporate a large number of biological replicates in their experimental design, but in reality, this is not always feasible. Due to time and budget constraints, an investigator may need to perform a pilot project that strikes a compromise between an ideal number of replicates and cost-effectiveness. In that event, how can a researcher decide on a sample size that’s going to maximize the effectiveness of their genomic data?
Power and sample size analysis are important tools to assist researchers with generating experimental designs. These analyses can guide researchers to the appropriate number of replicates for detecting difference between two or more groups, and maximize their opportunity for true positive discovery.
A study by researchers at Leiden University Medical Center explored new methods for power and sample size analysis to assist with experimental design . Their work expands upon previous power and sample size methods, that were limited to two-group comparisons, and they have developed a flexible method that can be used for power studies in a wide variety of statistical models used for hypothesis testing in high-dimensional data studies. Their method, SSPA, is available at: http://www.bioconductor.org/packages/release/bioc/html/SSPA.html
It is possible to increase the statistical power of an RNA-Seq experiment by increasing the sample size or the sequencing depth.
Researchers at the University of Hawaii Cancer Center performed a comprehensive comparison of five differential expression analysis packages (DESeq, edgeR, DESeq2, sSeq, and EBSeq) and demonstrated that increasing sample size is more potent than sequencing depth to increase power, especially when the sequencing depth reaches 20 million reads . They also found that paired-sample RNA-Seq significantly enhances the statistical power, confirming the importance of considering the multifactor experimental design.
The researchers provide a power analysis tool that captures the dispersion in the data and can serve as a practical reference under the budget constraint of RNA-Seq experiments: http://www2.hawaii.edu/~lgarmire/RNASeqPowerCalculator.htm
By following the recommendations put forth in these studies, investigators can better their chances for a successful profiling project which will ultimately conserve their resources in the long-run. A local optimal power is achievable for a given budget constraint, and the dominant contributing factor is sample size rather than the sequencing depth.
- Rajkumar AP, Qvist P, Lazarus R, Lescai F, Ju J, Nyegaard M, Mors O, Børglum AD, Li Q, Christensen JH. (2015) Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq. BMC Genomics 16:548.
- van Iterson M, van de Wiel MA, Boer JM, de Menezes RX. (2013) General power and sample size calculations for high-dimensional genomic data. Stat Appl Genet Mol Biol 12(4):449-67.
- Ching T, Huang S, Garmire LX. (2014) Power analysis and sample size estimation for RNA-Seq differential expression. RNA 20(11):1684-96.