|Customer Name||#######||Customer Institutaion||#######||Customer Email||#######||Project ID||#######|
Your reliable partner in genomics, transcriptomics and proteomics
Whole Exome Sequencing (WES) is an efficient strategy to selectively sequence the coding regions (exons) of a genome, typically human, to discover rare or common variants associated with a disorder or phenotype. By focusing sequence production on exons, which represents ~2.5% of the human genome, many more individuals can be examined at significantly reduced cost and time compared to sequencing their entire genomes. The most common methods rely on hybridization by oligonucleotide probes to 'capture' targeted DNA fragments, thereby enriching for exonic sequences. Targeted exonic sequences include well-established annotated coding and non-coding exons. Regions not within close proximity, on the order of 125-bases, of the targeted regions are not sequenced. Therefore, variants within introns, promoters or inter-genic regions are generally not detected.
The goal of this approach is to identify genetic variants that alter protein sequences, and to do this at a much lower cost than whole-genome sequencing. Since these variants can be responsible for both Mendelian and common polygenic diseases, such as Alzheimer's disease, whole exome sequencing has been applied both in academic research and as a clinical diagnostic.
Total DNA was isolated from whole blood collected in tubes with EDTA or tissue samples by using a standard DNA extraction protocol. The quantity of DNA was measured by reading A260/280 ratios by spectrophotometer. When A260/280 ratios located range 1.8 to 2.0, DNA was available. Then fragmented DNA samples by using sonication were subjected to library construction. Exome capture was performed using SureSelect Human All Exon V6 Kit (Agilent Technologies) following the vendor's recommended protocol and sequencing was performed using the Illumina Hiseq X Ten at LC Sciences for a 150-bp paired-end run.
Workflow of exome capture
Sequence and primary analysis
We sequenced generating a total of (____) million paired-end reads of 150bp length. This yielded (_____)G of sequence, representing approximately (______) times the size of the human all exome (50Mb).Prior to alignment, the low quality reads (1, reads containing sequencing adaptors; 2, nucleotide with q quality score lower than 20) were removed. After that, a total of (_____)G bp of cleaned, paired-end reads were produced. The raw sequence data have been submitted to the NCBI Short Read Archive with accession number (_____).
Alignment and duplicate marking
For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. And as first post-alignment processing step, Picard tools is utilized to identified and mark duplicate reads from BAM file.
Local realignment around INDELs
In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases creating false positive SNP calls. Local realignment uses these mismatching bases to determine if a site should be realigned, and applies a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts.
Base quality score recalibration
Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the probability of a sequencing error.
Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper, which Examine the evidence for variation from reference via Bayesian inference.
A Gaussian mixture model is fit to assigning accurate confidence score to each putative mutation call and evaluating new potential variants.
Variant function annotation
Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is utilized to add biological information to a set of variants.
Bioinformatics pipeline for whole exome sequencing
Species name: Human
Latin name: Homo sapiens
Specimens: tissue/whole blood
Disease name: Infantile autism
Disease type: Complex disease
document location: summary/1_RawData/sample_info_mendelian.xlsx
document location: summary/1_SequencingData_Overview/ReadsQC.xlsx
document location: summary/2_MappedData/ReadsDepthCoverage.png
Depth of coverage on each chromosome:
Depth of coverage=covered total length/total length of all exons on each chromosome
document location: summary/2_MappedData/DepthCoverageByChr.png
document location: summary/2_MappedData/MappedStatistics.xlsx
Depth of coverage on each sample:
document location: summary/2_MappedData/DepthCoverageByTarget.png
A single-nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide that occurs at a specific position in the genome including transition and transversion, where each variation is present to some appreciable degree within a population (e.g. > 1%).
For example, at a specific base position in the human genome, the base C may appear in most individuals, but in a minority of individuals, the position is occupied by base A. There is a SNP at this specific base position, and the two possible nucleotide variations – C or A – are said to be alleles for this base position.
SNPs underlie differences in our susceptibility to disease; a wide range of human diseases, e.g. sickle-cell anemia, β-thalassemia and cystic fibrosis result from SNPs. The severity of illness and the way our body responds to treatments are also manifestations of genetic variations. For example, a single base mutation in the APOE (apolipoprotein E) gene is associated with a higher risk for Alzheimer's disease.
A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration.
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that may be separated by many years, and may not be related to each other in any way. A microindel is defined as an Indel that results in a net change of 1 to 50 nucleotides.
In coding regions of the genome, unless the length of an Indel is a multiple of 3, it will produce a frameshift mutation. For example, a common microindel which results in a frameshift causes Bloom syndrome in the Jewish or Japanese population. Indels can be contrasted with a point mutation. An Indel inserts and deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels can also be contrasted with Tandem Base Mutations (TBM), which may result from fundamentally different mechanisms. A TBM is defined as a substitution at adjacent nucleotides (primarily substitutions at two adjacent nucleotides, but substitutions at three adjacent nucleotides have been observed.
Indels, being either insertions, or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. It has been shown that genomic regions with multiple Indels can also be used for species-identification procedures.
An Indel change of a single base pair in the coding part of an mRNA results in a frameshift during mRNA translation that could lead to an inappropriate (premature) stop codon in a different frame. Indels that are not multiples of 3 are particularly uncommon in coding regions but relatively common in non-coding regions. There are approximately 192-280 frameshifting Indels in each person. Indels are likely to represent between 16% and 25% of all sequence polymorphisms in humans. In fact, in most known genomes, including humans, Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites.
The term "Indel" has been co-opted in recent years by genome scientists for use in the sense described above. This is a change from its original use and meaning, which arose from systematics. In systematics, researchers could find differences between sequences, such as from two different species. But it was impossible to infer if one species lost the sequence or the other species gained it. For example, species A has a run of 4 G nucleotides at a locus and species B has 5 G's at the same locus. If the mode of selection is unknown, one can not tell if species A lost one G (a "deletion" event") or species B gained one G (an "insertion" event). When one cannot infer the phylogenetic direction of the sequence change, the sequence change event is referred to as an "Indel".
document location: summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx
document location: summary/3_VariantData/*/*.SNV.png
Tips: Ts means transition and Tv means transition.
Statistics of variant typies in SNP:
document location: summary/3_VariantData/VariantsType_SNP.xlsx
document location: summary/3_VariantData/*/*.SNP_VariantsType.png
All SNPs were annotated by SnpEff in VCF format. document location: summary/3_VariantData/*/*.snp.annotation.fixed.function.vcf
VCF format Description:
document location: summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx
document location: summary/3_VariantData/*/*.INDEL.png
Statistics of variant typies in Indel:
document location: summary/3_VariantData/VariantsType_INDEL.xlsx
document location: summary/3_VariantData/*/*.INDEL_VariantsType.png
All Indels were annotated by SnpEff in VCF format. document location: summary/3_VariantData/*/*.indel.annotation.fixed.function.vcf
VCF format Description:
The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.
As of build 131 (available February 2010), dbSNP had amassed over 184 million submissions representing more than 64 million distinct variants for 55 organisms, including Homo sapiens, Mus musculus, Oryza sativa, and many other species. A full list of organisms and the number of submissions for each can be found at: https://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi
document location: summary/4_VariantMultiAnno/*/SNP/*.snp.dbSNP.xlsx
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature (PMID: 20981092). In 2012, the sequencing of 1092 genomes was announced in a Nature publication (PMID: 23128226). In 2015, a paper published in Nature (PMID: 26432245) reported results and the completion of the project and opportunities for future research. Many rare variations, restricted to closely related groups, were identified, and eight structural-variation classes were analyzed.
The project unites multidisciplinary research teams from institutes around the world, including China, Italy, Japan, Kenya, Nigeria, Peru, the United Kingdom, and the United States. Each will contribute to the enormous sequence dataset and to a refined human genome map, which will be freely accessible through public databases to the scientific community and the general public alike.
By providing an overview of all human genetic variation, the consortium will generate a valuable tool for all fields of biological science, especially in the disciplines of genetics, medicine, pharmacology, biochemistry, and bioinformatics.
High frequency mutations (MAF>5%) were filtered. We remained low frequency mutations (0.5%<=MAF<=5%) as candidates for downstream analysis.
document location: summary/4_VariantMultiAnno/*/SNP/*.snp.dbSNP.xlsx
Those mutations on CDS, exon or splicing region (±10bp) will be chosen as candidates for downstream analysis.
document location: summary/4_VariantMultiAnno/*/SNP/*.snp.dbSNP.KGenome.func.xlsx
The SIFT (sorting intolerant from tolerant) algorithm helps bridge the gap between mutations and phenotypic variations by predicting whether an amino acid substitution is deleterious. SIFT has been used in disease, mutation and genetic studies, and a protocol for its use has been published on Nature Protocols (PMID: 26633127). Predicts whether an amino acid substitution affects protein function. SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST. SIFT can be applied to naturally occurring nonsynonymous polymorphisms or laboratory-induced missense mutations. The range of score is from 0 to 1. Mutations with score>0.05 are tolerant and have minor impact on protein function. Detrimental mutation with score<0.05 have a big influence on protein function.
document location: summary/4_VariantMultiAnno/*/SNP/*.snp.dbSNP.KGenome.func.syn.xlsx
De novo mutations have long been known to cause genetic disease, but their true contribution to the disease burden can only now be determined using family-based whole-genome or whole-exome sequencing (WES) approaches. De novo mutations play a prominent part in rare and common forms of diseases, including intellectual disability, autism and schizophrenia. De novo mutations provide a mechanism by which early-onset reproductively lethal diseases remain frequent in the population. These mutations, although individually rare, may capture a significant part of the heritability for complex genetic diseases that is not detectable by genome-wide association studies (PMID: 22805709 ).
In order to prove whether those mutations are correlation with some diseases, we calculate different mutation rate of same gene according mutation rate of known gene, gene length and sex ratio of patients.
document location: summary/5_ComplexDisease/*.snp.De_novo.tformat.GeneInfo.Filter.GO.KEGG.Gene.xlsx
Number of de novo variants:
document location: summary/5_ComplexDisease/GeneDenovoNum.xlsx
In genetics, the mutation rate is the frequency of de novo mutations in a single gene or organism over a various amount of time. Mutation rates are not constant and are not limited to a single type of mutation, therefore there are many different types of mutations. Mutation rates are given for specific classes of mutations. Point mutations, are a class of mutations, which are small or large scale insertions or deletions. There are also Missense and Nonsense mutations, which are variations of point mutations. The rate of these types of substitutions can be further subdivided into a mutation spectrum which describes the influence of the genetic context on the mutation rate.
There are several natural units of time for each of these rates, with rates being characterized either as mutations per base pair per cell division, per gene per generation, or per genome per generation. The mutation rate of an organism is an evolved characteristic and is strongly influenced by the genetics of each organism, in addition to strong influence from the environment. The upper and lower limits to which mutation rates can evolve is the subject of ongoing investigation. However, the mutation rate does vary over the genome. Over DNA, RNA or a single gene mutation rates are changing.
When the mutation rate in humans increases certain health risks can occur, for example, cancer and other hereditary diseases. Having knowledge of mutation rates is vital to understanding the future of cancers and many hereditary diseases.
document location: summary/5_ComplexDisease/geneMutRate_p_Filtered.xlsx
2575 West Bellfort Street
Local (713) 664-7087
Toll Free: 1-888-528-8818
Fax: (713) 664-8181