Documentation‎ > ‎

Advanced

Parameter Configuration


Alignment

BWA algorithm

Script: aln_sai.sh
Line 11: q=INT [default=30]

The option -q sets the value for read trimming. BWA trims a read down to: argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length.

A more complete set of options for BWA can be found in its project page.


SNP/Indel Calling

(1) GATK algorithm

Script: var_snp.sh
Line 38: -dcov INT [default=1000]
Line 43: -stand_call_conf INT [default=30.0]
Line 44: -stand_emit_conf INT [default=10.0]

Script: var_indel.sh
Line 42: -stand_call_conf INT [default=30.0]
Line 43: -stand_emit_conf INT [default=10.0]

The option -dcov defines the maximum coverage at a locus for any given sample. 

The option -stand_call_conf sets the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= INT are emitted as called sites.

The option -stand_emit_conf sets the minimum phred-scaled Qscore threshold to emit low confidence calls. Genotypes with confidence >= INT but less than the calling threshold are emitted but marked as filtered.

A more complete set of options for GATK can be found in its project page


(2) filtering for GATK called SNP/Indels

Script: var_filter.sh
Line 34: filter="(AB ?: 0) > 0.75 || QUAL < 50.0 || DP > 360 || SB > -0.1 || MQ0>=4"
Line 22: --clusterWindowSize INT [default=10]

Among the parameters for filter, AB sets the allele balance for hets (ref/(ref+alt); QUAL sets the mapping quality score; DP sets the values for total depth; SB sets the value for strand bias; and MQ0 sets the total number of mapping quality zero reads. Mapped reads fulfilling the set criteria will be marked as 'filter' in the output VCF file.

The option --clusterWindowSize sets the window size in bases in which to evaluate clustered SNP/Indels. To disable the clustered SNP filter, set this value to less than 1.


(3) SAMtools algorithm

Script: var_pileup.sh
Line 34: -DINT [default=4000]
The option -D sets the maximum read depth per sample to call a SNP.

A more complete set of options for SAMtools can be found in its project page.


Read-pair mapping

BreakDancer algorithm

Script: var_sv_rpm.sh
Line 39: minsize=INT [default=50]

The minsize variable sets the minimal size of variants to be reported in the GFF output. The -l in BreakDancer can be used to enable Illumina's long mate-pair library analysis.

A more complete set of options for BreakDancer can be found in its project page.


Read-depth analysis

CNVnator algorithm

Script: var_sv_rda.sh
Line 13: binsize=INT [default=100]
Line 57: minsize=INT [default=50]

The binsize option sets the bin size window in base pairs for partitioning of the RD signal with the mean-shift approach.

The minsize variable sets the minimal size of variants to be reported in the GFF output.

A more complete set of options for CNVnator can be found in its project page.


Split-read analysis

Pindel algorithm

Script: var_sv_sra.sh
Line 47: isize=INT [default=300]
Line 59: $bOpt
Line 65: 
minsize=INT [default=50]

The isize value: to determine isize, the program first checks if BreakDancer has already calculated the isize; if not then checks if the isize is contained in the header line "PI" in BAM file(s). If still not users can set isize by giving the INT value.

The $bOpt (-b) option: Pindel is able to use calls from other SV methods such as BreakDancer to further increase sensitivity and specificity. BreakDancer result or calls from any methods must in the format: ChrA LocA stringA ChrB LocB stringB other. If the BreakDancer output already exists, with the -b option Pindel uses the result from BreakDancer to improve its calls.

The minsize variable sets the minimal size of variants to be reported in the GFF output.

A more complete set of options for Pindel can be found in its project page.


Junction mapping 

BreakSeq algorithm

Script: var_sv_jct.sh

A more complete set of options for BreakSeq can be found in its project page.


Modulization

HugeSeq is flexible in integrating other variant call algorithms than the current ones. Users can provide a separate script specifying the command lines and parameters for the program of interest, and change the relevant settings in  hugeseq.py and the module file. Note that the input file for variant calls are by default in BAM format, and the output file are by default as GFF format. 

The following provides an example of replacing the CNVnator module with the RDXplorer module for variant calls, both of which utilize read depth coverage.


RDXplorer project page: http://rdxplorer.sourceforge.net/

(1) Download and install RDXplorer.

(2) Add paths and env for RDXplorer in $app_dir/HugeSeq/modulefiles/hugeseq/1.0:
Under "# Setting paths and env for variant detection annotation tools" add: 
setenv RDXPLORER $app_dir/rdxplorer

(3) According to the README file of RDXplorer, users can run RDXplorer by running the script run.sh, which reads in other scripts. Therefore copy these scripts from the RDXplorer package to $app_dir/HugeSeq/v1.0/bin. For better readability, users can rename the run.sh script as: var_sv_rda_RDXplorer.sh

(4) Change the file paths in run.sh as requested by the RDXplorer. Users can choose to tune the parameters for variant calls as specified by the RDXplorer.

(5) The following lines defines the program for RDA analysis in $app_dir/HugeSeq/v1.0/bin/hugeseq.py

Line 175     if (variants is None or "RDA" in variants):
Line 176         job=Job('var_sv_rda-%s'%idprefix)
Line 177         job.output=File(output+".rda.gff")
Line 178         jobs2.append(job.append('var_sv_rda.sh %s %s'%(job.output,input)).depend(*pjobs))

In line 178, substitute the current script name "var_sv_rda.sh" with the new script name "var_sv_rda_RDXplorer.sh".

(6) Now the RDXplorer is incorporated into HugeSeq. For downstream integration, the RDXplorer output should be converted to GFF format. Users can refer to the codes in HugeSeq, e.g. lines 57-62 in var_sv_rda.sh.


Users can add more variant call modules by providing scripts and iterate the above (1) - (6) steps.  For new modules that can be classified into one or more of the following variant call types: SNP, INDEL, SRA, RPM, RDA, and JCT, users will just need to add a few lines as the lines 175-178 shown in the above example to $app_dir/HugeSeq/v1.0/bin/hugeseq.py, under the "def __callvars(jobs, idprefix, output, inputs, pjobs, variants)" section. 

If a new brand strategy is incorporated, users can expand the help=' ' list in:

 $app_dir/HugeSeq/v1.0/bin/hugeseq.py
Line 23:    parser.add_argument('-v', '--variants', metavar='TYPE', nargs="+", help='SNP, INDEL, SRA, RPM, RDA, JCT (default to all)')

Then add the corresponding lines as the lines 175-178 shown in the above example to $app_dir/HugeSeq/v1.0/bin/hugeseq.py, under the "def __callvars(jobs, idprefix, output, inputs, pjobs, variants)" section. 

Comments