Sequence alignment

BWA version 0.5.9 was used for sequence alignment against the reference genome. Illumina’s quality score was converted into Sanger’s quality score by BWA. The multithreading option was enabled with two concurrent threads for generating the SA coordinates. The original alignment output, which was in a SAM format, was converted into BAM using SAMtools version 0.1.14.

Alignment processing

Sorting of the BAMs was done by the Picard tool version 1.32 and binning the BAMs by chromosome was performed using SAMtools version 0.1.14. Picard was used to remove duplicates in alignments, whereas GATK version 1.0.5506 was used for local realignment and base recalibration.

SNP and Indel detection

The UnifiedGenotyper in GATK version 1.0.5506 was used for SNP and Indel detection with call confidence set to 30.0 and emit confidence set to 10.0. Dindel model was enabled in Indel calling. Filter label was applied using the VariantFiltration program in GATK for allele balance (AB) greater than 0.75, quality score (QUAL) less than 50.0, depth of coverage (DP) greater than 360, strand bias (SB) greater than -0.1 or mapping-quality zero reads (MQ0) greater than or equal to 4. The mpileup function in SAMtools/BCFtools version 0.1.14 was also used for SNP and Indel detection.

SV and CNV detection

BreakDancer version 1.1 was used for paired-end mapping. Pindel version 0.2.2 was used for split-read analysis. Calls from BreakDancer were used in Pindel to increase sensitivity and specificity. CNVnator version 0.2.2 was used for read-depth analysis. BreakSeq version 1.3 was modified as described in the results section to support BAM input in junction mapping. Only SV and CNV calls greater than or equal to 50bp were selected for final output.

Variant merging

VCFs generated for SNP and Indel calls were concatenated and merged using VCFtools version 0.1.5. VCFs were indexed using Tabix version 0.2.4. Outputs from SV and CNV detection were converted into GFF using custom shell scripts and merged using BEDtools version 2.12.0 for calls with 50% reciprocal overlapping.

Functional annotation

ANNOVAR version 20110506 was used for functional annotation of variants. The UCSC known genes and repeat masker databases were used for gene and repeat annotations respectively. SIFT scores were based on the SIFT database. SNP annotation was based on dbSNP version 132.