The VCF format is still under development as part of the 1000 Genomes project but seems to be emerging as the most widely supported standard for storing information about variants within mapped sequence data. Version 4.1 is the latest draft version in development and discussed extensively here. The file format is quite data-rich and fairly complex.

Additional sources of information:

In particular, the various Qual and Likelihood fields can be seriously tricky, most are PHRED scaled (-10log10()), but not always interpreted the same, for example:

  • QUAL –> -10log_10 prob(call in ALT is wrong) so, bigger is MORE confident
  • GQ –> -10log_10p(genotype call is wrong, conditioned on the site’s being variant) so again, bigger is MORE confident
  • PL –> L(data given that the true genotype is X/Y) so, bigger is LESS confident

Also confusing is the way in which the PL likelihoods are arranged, consider an entry like:

NC_015433.1_Ss_ST3 2721 . G A 222 .
DP=68;VDB=0.0002;AF1=1;AC1=2;DP4=0,0,24,44;MQ=32;FQ=-232 GT:PL:DP:GQ

In this case, the REF is G, the ALT is A, the genotype call is 1/1, and the PL is 255,205,0. These correspond to genotypes: GG:255, GA:205, AA:0. Since 0 is the smallest, it is the most likely given the data.

Things are more tricky in a triallelic case:

NC_015433.1_Ss_ST3 147001 0 A C,G 53.7 0 INDEL;DP=53;VDB=0.0446;AF1=1;AC1=2;DP
4=17,9,15,8;MQ=25;FQ=-50.5;PV4=1,1,1,0.00012 GT:PL:DP:GQ  
SAMPLE1 = 1/1:255,177,161,144,0, 122:49:29

It is important to note that although the VCF specs call for assigning different numbers to each variant (0=A, 1=C, 2=G in this case), SAMtools breaks this requirement and assigns everything that isn’t REF to 1. So, if we consider the PL scores in this case (for triallelic the pattern is RR,RA1,A1A1,RA2,A1A2,A2A2 where R=reference, A1=alt1, A2=alt2).

AA:255, AC:177, CC:161, AG:144, CG:0 GG:122. Since CG (A1A2) has the smallest, this is the actual call. Despite generating the likelihoods of the genotypes given the reads, SAMtools mpileup still calls it 1/1 instead of 1/2 as it should.

-10log_10 prob(call in ALT is wrong)