This page contains answers to many of the common questions asked about VarScan usage, performance, input/output, etc.
Which version of VarScan should I use?
Which VarScan command should I use?
Can I use VarScan on WGS, exome, or RNA-seq data?
Does VarScan work for pooled samples?
How do I use VarScan for validation?
Can I use VarScan on my model organism or microbial genome?
What input format should I give to VarScan?
Which aligner should I use?
Do I need Illumina or Phred scale base qualities?
What about mapping qualities?
What do the output columns mean?
How are the p-values calculated?
Why are all of my p-values 0.98?
How should I filter the output files?
Does VarScan provide a confidence score similar to SAMtools?
COMMON ISSUES, WARNINGS, AND ERRORS
Warnings about "resetting normal" or "resetting tumor" file
Read counts from VarScan are different from SAMtools/IGV counts
Always use the latest version! Currently, this is v2.3.x. There are differences between versions; notably, between
v2.2.2 and v2.2.3 we adjusted how reads are counted for indels, which had exhibited strange behavior due to some
peculiarities in how SAMtools represents indels in the pileup files.
If you have a single sample and wish to call variants, you can use pileup2snp
to call SNPs, pileup2indel
to call indels,
to call consensus genotypes at every position meeting the coverage requirement.
To call both SNPs and indels simultaneously, use pileup2cns
with the --variants
parameter set to 1.
If you have tumor-normal pairs, you should use the somatic
command to call mutations (SNVs/indels) and the
command to call copy number alterations (CNAs).
Yes, you can use VarScan for any of these. The default settings are optimized for exome data, where one expects to have
at least 10x or 20x coverage across targeted exons. By lowering the minimum coverage requirement (to perhaps 6x) one
can call variants in lower-coverage data. Warning: the output files for WGS data may be quite large, and the runtimes longer,
since it will call 3+ million variants per genome. VarScan works for RNA-seq data as well, though this data tends to be
noisier for variant calling due to RT errors, allele-specific expression, and alignment difficulties.
Absolutely. It's simply a matter of setting appropriate input parameters, particularly
In pooled data, you might specify a higher minimum coverage, a lower variant allele frequency
(conservatively, 0.50 / the number of samples), and a less-stringent p-value to detect rare variants.
If validating germline SNPs/indels on an orthogonal sequencing platform, use the pileup2cns
command, which will call
consensus genotypes at all positions with sufficient coverage, rather than just the variants. This lets you determine
sites that are refuted as wild-type. You might start with the following parameters:
If validating somatic SNPs/indels on an orthogonal sequncing platform, use the somatic
set to 1.
You might start with the following parameters:
Of course. All you need is a BAM file of your reads aligned to a reference sequence. If you don't have a reference sequence
to which you might align reads, then no.
VarScan expects its input in SAMtools pileup format, which is obtained from a BAM file via the samtools pileup command.
samtools pileup -f myReference.fasta myReads.bam >myPileup.pileup
java -jar VarScan.jar pileup2snp myPileup.pileup
To save on disk space and I/O, you can also use a UNIX "pipe" command to forward the pileup output directly into VarScan:
samtools pileup -f myRef.fasta myBam.bam | java -jar VarScan.jar pileup2snp
Great question! The choice of an aligner is an important one, and entire review articles have been written on the topic.
For practical purposes, you should use an aligner whose output is (or can be converted to) a SAM/BAM file. I have heard
that popular aligners for include BWA, Bowtie, and Novoalign for Illumina data, SHRiMP and BFAST for SOLiD data, and
SSAHA2 and BWA-SW for Roche/454 data.
VarScan expects Phred-scaled (Phred+33, also called "Sanger") base qualities, in which a score of 20 indicates a 1/1000
probability of base error. By default, VarScan requires a minimum base quality of 15 or 20 (depending on the application),
so this value should be adjusted appropriately if alternate base qualities are used.
Many aligners provide a Phred-scaled quality value (mapping quality) for every read's alignment, which is correlated to
the probability that the read is correctly mapped. A mapping quality of 0 typically indicates that the given read has
many possible mapping locations of equal probability, and that the location given was chosen randomly. Thus, it's best
to exclude reads with mapping quality of 0 from most downstream analyses. A minimum mapping quality of 10 is even better.
It's possible to apply this threshold to a BAM file using SAMtools, as follows:
samtools view -b -q 10 myBam.bam | samtools pileup -f myRef.fasta -
| java -jar VarScan.jar pileup2snp
By default, all VarScan output files should include headers. For detailed descriptions of the output columns
and their meanings, see the "output" sections of the germline
You can also specify VCF output, which is widely documented.
P-values are calculated using a Fisher's Exact Test on the read counts supporting reference and variant alleles. For details,
see the germline
The p-value calculations are computationally expensive, so if the user doesn't specify a p-value threshold, VarScan
skips the calculation in pileup2snp, pileup2indel, and pileup2cns. Instead, it inserts a dummy value of 0.98. To
get p-values, set the --p-value
parameter to something like 0.10, 0.05, or 0.01.
VarScan's default parameters aim to be sensitive when detecting variants, with the trade-off that it will report
a lot of candidate variants, many of which will be false positives.
These should be further filtered by coverage, variant allele frequency, strand
representation, and p-value to isolate high-confidence calls. When performing the initial discovery, it is recommended
that you set --strand-filter
to 1. After discovery, you should run VarScan filter
to further refine
your variant calls.
It is possible to calculate a Phred-scaled confidence score from the p-value that VarScan provides:
$score = -10 * log10($p_value);
$score = 255 if($score > 255);
This is done on a per-sample basis when --vcf-output
is specified, since VCF format expects Phred-scaled confidence scores.
COMMON ISSUES, WARNINGS, AND ERRORS
If you provide two separate input (pileup) files, VarScan attempts to use the chromosome and position to simultaneously parse both files and match them up. Doing so while accounting
for alphanumeric chromosome names is difficult, especially when there are chromosomes for which only one sample had coverage. VarScan does its best to ensure that no chromosomes
are missed due to a sorting issue by resetting one sample's pileup file, to ensure that no chromosomes are missed due to sorting. When you have many chromosomes or contigs with
coverage in just one sample, you'll see a lot of these warnings.
The best way to address this simultaneous parsing issue is quite simple: provide VarScan with a two-sample MPILEUP file
(normal and tumor) instead of two individual pileup files:
samtools mpileup -f reference.fasta -q 1 -B normal.bam tumor.bam >normal-tumor.mpileup
java -jar VarScan.jar somatic normal-tumor.mpileup normal-tumor.varScan.output --mpileup 1
In the mpileup file, SAMtools already does the chromosome and position matching-up, so there's no room for error.
You'll also notice that I provided the -B parameter. This disables SAMtools BAQ computation, which is turned
on by default in SAMtools mpileup, but (as the author admits) occasionally too stringent for variant calling.
By default, SAMtools and IGV show and count all bases at a given position, regardless of base quality. In contrast, VarScan requires that bases meet the minimum Phred
quality score (default 15 for most commands) to count them for things like read counts (reads1, reads2) and to compute variant allele frequency. However, when VarScan
reports the depth (such as in the DP field of VCF output), it reports SAMtools raw depth. To get VarScan read counts to more closely match another tool, set
use parameter --min-avg-qual 0. And use caution! Low-quality bases, with the occasional exception of BAQ penalties, should not be trusted.
Also, VarScan reports variants on a biallelic basis. That is, for a given SNP call, the "reads1" column is the number of reference-supporting reads (RD), and the "reads2" column is
the number of variant-supporting reads (AD). There may be additional reads at that position showing other bases (SNP or indel variants). If these other variants meet the calling
criteria, they will be reported in their own line. If not, it may look like you have "missing" reads.