ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Tutorial: Difference between revisions

From angsd
Jump to navigation Jump to search
(Replaced content with "<pre> This page might be outdated, please see the 'Quick Start' shown in sidebar. </pre>")
 
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Prepare files=
angsd can work on remote bam files therefore first download a list with 23 unrelated europeans from the 1000genomes project.
<pre>
<pre>
wget http://popgen.dk/netstuff/files.list
This page might be outdated, please see the 'Quick Start' shown in sidebar.
</pre>
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
Contents of the file 'files.list'
<pre class="mw-collapsible-content">
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12003/alignment/NA12003.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12004/alignment/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12005/alignment/NA12005.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12006/alignment/NA12006.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11831/alignment/NA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11832/alignment/NA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11992/alignment/NA11992.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11993/alignment/NA11993.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11994/alignment/NA11994.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11995/alignment/NA11995.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12154/alignment/NA12154.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12155/alignment/NA12155.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12156/alignment/NA12156.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06994/alignment/NA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11840/alignment/NA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12044/alignment/NA12044.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12813/alignment/NA12813.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12760/alignment/NA12760.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12762/alignment/NA12762.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06985/alignment/NA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA11881/alignment/NA11881.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12249/alignment/NA12249.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
</pre>
</div>
==Understaing angsd options==
As a simple reference for the program we have made most of the methods within angsd easy viewable by writing the associated command.
All options are given by
<pre>
-parameter value
</pre>
It's important that there are no space between the dash and the paramater, it is important that there are a space betwwen the parameter and the value. Futhermore the parameter is casesensitive.
Simply writing angsd will give you the helpscreen.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd
<pre class="mw-collapsible-content">
Command:
angsd
-> angsd version: 0.515 build(Mar 23 2013 12:35:04)
-> Please use the website "http://www.popgen.dk/angsd" as reference
-> Use -nThreads for number of threads allocated to the program
Overview of methods:
-GL estimate genotype likelihoods
-doCounts Calculate various counts statistics
-doDepth Do depth statistics
-doAsso perform association study
-doMaf estimate allele frequencies
-doError estimate the type specific error rates
-doAnsError estimate the errorrate based on perfect fastas
-doHWE Est inbreedning per site
-doGeno call genotypes
-realSFS Estimate the SFS and/or perform neutrality tests
Below are options that can be usefull
-bam Options relating to bam reading
-doMajorMinor Infer the major/minor using different approaches
-ref/-anc Read reference or ancestral genome
many others
For information of specific options type:
./angsd METHODNAME eg
./angsd -GL
./angsd -doMaf
./angsd -doAsso etc
Examples:
Estimate MAF for bam files in 'list'
'./angsd -bam list -GL 2 -doMaf 2 -out RES -doMajorMinor 1'
</pre>
</div>
An explanation for every parameter is shown beside the parameter, and for every of these options we can get additional information by typing that parameter solely without any options. An example below for the methods relating to genotype likelihood calculation.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -GL
<pre class="mw-collapsible-content">
-> angsd version: 0.515 build(Mar 27 2013 11:15:58)
-> Analysis helpbox/synopsis information:
---------------------
analysisEstLikes.cpp:
-GL=0:
1: SAMtools
2: GATK
3: SOAPsnp
4: SYK
-minQ 13 (remove bases with qscore<minQ)
-trim 0 (zero means no trimming)
-tmpdir angsd_tmpdir/ (used by SOAPsnp)
-errors (null) (used by SYK)
-minInd -1 (-1 indicates no filtering)
Filedumping:
-doGlf=0
1: binary glf (10 log likes) .glf
2: beagle likelihood file .beagle.gz
3: binary 3 times likelihood .glf
4: text version (10 log likes) .glf
</pre>
</div>
This tells you that there are 4 different genotype likelihood models implemented and you can choose accordingly by writing -GL 1 for the SAMtools model. We also see that we can dump the genotype likelihoods in four different ways.
=Getting simple Counts/depth=
For some analysis simply getting the sequencing depth for all sites could be of interest, this can of analysis is grouped in the '-doCounts' methods.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -doCounts
<pre class="mw-collapsible-content">
</pre>
</div>
So if we wanted the sum of ACGTS across all samples we could write
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -bam files.list -doCounts 1 -dumpCounts 3 -out tstCounts -nInd 10
<pre class="mw-collapsible-content">
---------------
head -n tstCounts.counts
totA totC totG toft
0 0 1 0
0 1 0 0
1 0 0 0
1 0 0 0
2 0 0 0
0 0 0 2
0 0 0 2
0 0 0 2
0 0 2 0
</pre>
</div>
Or if we wanted the sequencing depth per sample but only for the good quality data
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -bam files.list -doCounts 1 -dumpCounts 2 -out tstCounts -nInd 10 -minQ 20 -minMapQ 30
<pre class="mw-collapsible-content">
---------------
head tstCounts.counts
ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0
</pre>
</div>
=Frequencies=
We can also estimate the allele frequencies. This we do by using the -doMaf option.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -doMaf
<pre class="mw-collapsible-content">
analysisMaf.cpp:
-doMaf 0
1: BFGS frequency (known major minor)
2: EM frequency (known major minor)
4: BFGS frequency (unknown major minor)
8: EM frequency (unknown major minor)
16: frequency from genotype probabilities
32: alleleCounts based method (known major minor)
-doSNP 0
-minMaf 0.010000 0
-minLRT 24.000000 0
-ref (null)
-anc (null)
-doZ 0
-eps 0.001000 [Only used for -doMaf &32]
</pre>
</div>
So if we try to use -doMaf 2 angsd will complain!
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -bam files.list -doMaf 2 -out tstMaf
<pre class="mw-collapsible-content">
Error: you need to specify doMajorMinor in order to doMaf
</pre>
</div>
So lets decide also estimate the major and minor, what are the options.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd -doMajorMinor
<pre class="mw-collapsible-content">
-------------------
analysisMajorMinor.cpp:
-doMajorMinor 0
1: Infer major and minor from GL
2: Infer major and minor from allele counts
3: use major and minor from bim file (requires -filter afile.bim)
4: Use reference allele as major (requires -ref)
5: Use ancestral allele as major (requires -anc)
</pre>
</div>
Let us infer the major and minor using the genotype likelihoods.
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -bam files.list -doMaf 2 -out tstMaf -doMajorMinor 1
<pre class="mw-collapsible-content">
-doMajorMinor 1 is based on genotype likelihoods, you must specify a genotype likelihood model -GL
</pre>
</div>
So now we need to specify which genotype likelihood model we want to use, let us see what our options are
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -GL
<pre class="mw-collapsible-content">
---------------------
analysisEstLikes.cpp:
-GL=0:
1: SAMtools
2: GATK
3: SOAPsnp
4: SYK
-minQ 13 (remove bases with qscore<minQ)
-trim 0 (zero means no trimming)
-tmpdir angsd_tmpdir/ (used by SOAPsnp)
-errors (null) (used by SYK)
-minInd -1 (-1 indicates no filtering)
Filedumping:
-doGlf 0
1: binary glf (10 log likes) .glf
2: beagle likelihood file .beagle.gz
3: binary 3 times likelihood .glf
4: text version (10 log likes) .glf
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -bam files.list -doMaf 2 -out tstMaf -doMajorMinor 1 -GL 1 -nInd 20
<pre class="mw-collapsible-content">
head tstMaf.mafs
chromo position major minor knownEM nInd
chr1 11782 G A 0.000008 1
chr1 11783 C A 0.000008 1
chr1 11784 A C 0.000008 1
chr1 11785 A C 0.000008 2
chr1 11786 A C 0.000008 3
chr1 11787 T A 0.000008 3
chr1 11788 T A 0.000008 3
chr1 11789 T A 0.000008 4
chr1 11790 G A 0.000008 4
</pre>
</div>
These sites are all invariable so lets filter out the sites with a maf below 0.5%
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -bam files.list -doMaf 2 -out tstMaf -doMajorMinor 1 -GL 1 -nInd 20 -minMaf 0.005 -nThreads 10
<pre class="mw-collapsible-content">
head tstMaf.mafs
chromo position major minor knownEM nInd
chr1 13032 A T 0.016541 9
chr1 13038 T C 0.073424 9
chr1 13309 G T 0.207854 4
chr1 13396 T A 0.012727 16
chr1 13482 G C 0.019104 15
chr1 13502 G A 0.025732 14
chr1 13519 T C 0.018256 17
chr1 14933 A G 0.487768 1
chr1 16259 T G 0.060360 7
</pre>
</div>
The filters works across the different analysis classes, so if we supply the dumpCounts we will only get the sites with a maf >0.5%
<div class="toccolours mw-collapsible mw-collapsed" >
./angsd0.530/angsd -bam files.list -doMaf 2 -out tstMaf -doMajorMinor 1 -GL 1 -nInd 20 -minMaf 0.005 -nThreads 10 -doCounts 1 -dumpCounts 3
<pre class="mw-collapsible-content">
head tstMaf.mafs
paste tstMaf.pos tstMaf.counts |head
chr pos totDepth totA totC totG totT
chr1 13032 21 20 0 0 1
chr1 13038 21 0 1 0 20
chr1 13309 8 0 0 7 1
chr1 13396 34 1 0 0 33
chr1 13482 30 0 1 29 0
chr1 13502 25 1 0 24 0
chr1 13519 26 0 1 0 25
chr1 14933 2 1 0 1 0
chr1 16259 9 0 0 1 8
</pre>
</div>

Latest revision as of 01:31, 10 January 2014

This page might be outdated, please see the 'Quick Start' shown in sidebar.