Documentation‎ > ‎

Formats

Input Format

HugeSeq utilizes sequence reads in either a FASTA or FASTQ format; these reads are usually generated from high throughput DNA sequencing platforms or can be converted from the raw output by vendor-provided program.

Due to the enormous size of the sequence data, HugeSeq supports compressed data in GZIP format as input as well. 

Example of a FASTA:

>sequence1
tttttttttttttttttttttgagacagagttttgctctt
gtctcccaggctagagtgcagctgcatgatctcagctcac
tgtagcctctgcctcccgggttaaagctattctcgtgctt
caacctcccaagtagctgggactacaatcgtgcaccacca
> sequence2
agtagctgggattacaggcatgcaccaccatgcctggcta
atttttttgtattttttgtagagacagggtttcaccatgt
tggtcaggctggtctcgaactcctgaccttgggtgatctt
cctgcttcggcctcccaaagtgctgggattacaggcgtga

Example of a FASTQ:

@HS2000-192_107:1:1:19725:3097/1
CTGCAGACCTCAGCAGTGAGGCCAGGAGAGCACGCAGACCACCAGCAGGTGAAGGACAGCGC
+
GGGGGGGGGGGGGGGGGFGGFGGGGGEEEGGGGFGGGGGGGGGGGEGGGEFDFDFBFFEFE?
@HS2000-192_107:1:1:19581:3114/1
AAAGATAGAACTTGGGGACCACTCATTTCTAAAAGTGGCAAGTAGATGGGAATCTAATTAAA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGE

Output Format

The resulted SNP and Indel call sets (e.g. from GATK and SAMtools/BCFtools) are in standard VCF format. 

Example of a VCF:

##fileformat=VCFv4.0
#CHROM  POS     ID              REF     ALT     QUAL    FILTER  INFO                FORMAT  SAMPLE
chr1    12783   rs62635284      G       A       57.33   PASS    DB;DP=3;            GT      1/1
chr1    13273   .               G       C       43.15   filter  AB=0.67;DP=6;       GT      0/1
chr1    13896   rs74583755      C       A       20.03   LowQual  AB=0.77;DB;DP=13;  GT      0/1
chr1    14673   rs11582131      G       C       48.15   filter  AB=0.33;DB;DP=3;    GT      0/1
chr1    14907   rs6682375       A       G       62.22   PASS    AB=0.67;DB;DP=9;    GT      0/1
chr1    14930   rs6682385       A       G       96.17   PASS    AB=0.60;DB;DP=10;   GT      0/1
chr1    15190   rs71230572      G       A       21.14   LowQual  AB=0.50;DB;DP=6;   GT      0/1
chr1    15211   rs11586607      T       G       34.36   filter  AB=0.50;DB;DP=6;    GT      0/1

Due to the fact that the SV and CNV callers are usually generating variant calls in their own formats, HugeSeq has standardized their outputs by converting them into the standard General Feature Format (GFF), which supports the representation of genomic intervals.

Example of a GFF:

#chr    source          event           start   end     score   strand  frame   attributes
chr1    CNVnator        Deletion        1       10000   .       .       .       Size 10000;RD 0
chr1    CNVnator        Duplication     11601   15800   .       .       .       Size 4200;RD 1.69
chr1    CNVnator        Deletion        36701   48000   .       .       .       Size 11300;RD 0.72
chr1    CNVnator        Deletion        65901   101200  .       .       .       Size 35300;RD 0.73
chr1    CNVnator        Duplication     107001  111200  .       .       .       Size 4200;RD 1.50
chr1    CNVnator        Deletion        168301  168800  .       .       .       Size 500;RD 0.36

Comments