SAMtools

SAMtools is a set of tools for manipulating files in SAM (Sequence Alignment/Map) format.

A typical application is to call variants based on differences in reads and a reference genome or reference contigs.

Sample workflow using bwa for read mapping:

Generate a bwa index of reference:

bwa index ./ref_index/reference.fasta

Consider modifying the reference so that entry name(s) is concise and has no spaces (_ and - seem to work fine). This is helpful, both because the results are easier to fit on a screen, and because bwa and samtools handle spaces and possibly other special characters in the name differently. This can result in confusing errors later in the process.

Map reads to reference and generate a gzipped “sam” file:

bwa aln -I -t 3 ./ref_index/reference.fasta lane1_trimmed.fastq >
lane1.aln.sai  
bwa samse ./ref_index/reference.fasta lane1.aln.sai lane1_trimmed.fastq | gzip
> lane1.sam.gz

Make sure to double check whether quality scores are Illumina (Phred+64) or Sanger (Phred+33). More recent versions of bwa support the Illumina encoding with the -I switch.
The bwa program has different modes for single-end (samse), paired end (sampe) and long reads (bwasw). Check the iBEST knowledgebase for tips, or the manual for more detailed information.

Sort the sam file so that reads are organized with respect to their mapping position and convert to bam:

samtools view -uS lane1.sam.gz | samtools sort - lane1

Generate a report on mapping quality using samstat:

samstat lane1.bam

Calculate coverage and generate an “mpileup” formatted file (-B=no BAQ computation -Q=minimum qual 0, -d=max coverage depth):

samtools mpileup -BQ0 -d10000000 -f ****./ref_index/reference.fasta lane1.bam > lane1_coverage.mpileup

samtools mpileup can handle one or many bam files. A full description of the command line switches and usage is available on the SAMtools manual page. This list is also available by typing “samtools mpileup” with no additional parameters.

SAMtools can also generate output in VCF/BCF format (Variant Call Format /Binary-variant Call Format). These can be further manipulated with bcftools.

Generate VCF report for ALL positions, with per-sample coverage and output to a vcf (-D=per sample depth, -u=uncompressed BCF):

samtools mpileup -D -B -u -f ****./ref_index/reference.fasta lane1.bam | bcftools view -gc - > lane1_unfiltered_all_positions.vcf

The output (lane1_unfiltered_variants.vcf) would then need to be filtered by the user. The data is in tabular format making R is an excellent choice for this task, however the VCF format van be a little intimidating so check the s pecifications before trying to decoding the various fields.

Alternatively you can use some of the more advanced features of samtools and bcftools to filter variants. This is purported to reduce false positives, but may also result in a failure to report some real SNPs.

samtools mpileup -uf ./ref_index/reference.fasta lane1.bam | bcftools view -bvcg - > lane1_only-variant-positions.bcf
bcftools view lane1_only-variant-positions.bcf > lane1_only-variant-positions.vcf

Further filtering can be carried out with vcfutils.pl, i.e. for max coverage:

bcftools view lane1_only-variant-positions | vcfutils.pl varFilter -D100 > lane1_variants_max-coverage-100.vcf