ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Input: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
Line 100: Line 100:


;NB and remember to supply a -fai file
;NB and remember to supply a -fai file
=VCF files=
VCF file as input is now included but with some limitiations. Only chr,pos,ref,alt and gp/gl tags are being used, and we discard indels and non diallelic sites.
Futhermore you are required to include a fai file and the number of individuals.
<pre>
#for using GL tags
./angsd -vcf-gl ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two
#for using GP tags
./angsd -vcf-gp ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two
</pre>


=Genotype Probability Files=
=Genotype Probability Files=

Revision as of 10:44, 19 December 2014

ANSGD currently supports various formats.

  1. BAM files
  2. Genotype Likelihood files
  3. Beagle files.
  4. vcf files (will extract alt/ref and gl/gp tags)

Below is a short description of those we believe is of most use.

BAM files

ANGSD accepts bam files for mapped sequences. For imformation on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference and position.

Brief Overview

./angsd -bam
Command:
./angsd -bam 
	-> angsd version: 0.569	 build(Dec 11 2013 14:47:21)
	-> Analysis helpbox/synopsis information:
---------------
parseArgs_bambi.cpp: bam reader:
	-bam		(null)	Supply a file list of BAMfiles
	-i		(null)	Supply a single BAMfile
	-r		(null)	Supply a single region in commandline (see examples below)
	-rf		(null)	Supply multiple regions in a file (see examples below)
	-remove_bads	1	Discard 'bad' reads, (flag >=255) 
	-nInd		0	Only use first nInd from the filelist from the -bam argument
	-nLines		50	Read nLines from files at a time
	-uniqueOnly	0	Discards reads that doesn't map uniquely
	-show		0	Mimic 'samtools mpileup' also supply -ref fasta for printing reference column
	-minMapQ	0	Discard reads with mapping quality below
	-minQ	13	Discard bases with quality below this value
	-only_proper_pairs	1	Only use reads where the mate could be mapped
	-C		0	adjust mapQ for excessive mismatches (as SAMtools), supply -ref
	-baq		0	adjust qscores around indels (as SAMtools), supply -ref

Examples for region specification:
		chr:		Use entire chromosome: chr
		chr:start-	Use region from start to end of chr
		chr:-stop	Use region from beginning of chromosome: chr to stop
		chr:start-stop	Use region from start to stop from chromosome: chr
		chr:site	Use single site on chromosome: chr

Arguments

-bam [filelist]
-b [filelist]

The filelist is a file containing the full path for each bam file with one filename per row.

Example of a filelist with 6 individuals

/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
/home/software/angsd/test/smallBam/smallNA12004.bam
/home/software/angsd/test/smallBam/smallNA06985.bam
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA12761.bam

Example of estimating allele frequencies from bam files

./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5

Optional arguments

-r [region]

Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples

chr1:1-10000             \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                 \\all of chr11
chr11:                   \\all of chr11
chr7:123456              \\position 123456 of chr7
-only_proper_pairs [int]=0

Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1

-rf [region file]

specify multiple regions in a file.

-nLines [int]=50

Number of lines to read per file at a time. Reducing this number will decrease the RAM usage with a small cost to the speed.

-uniqueOnly [int]=0

remove reads that have multiple best hits.. 0 no (default), 1 remove

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)

-minQ [int]=0

minimum base quality

-minMapQ [int]=0

minimum mapQ quality -baq [int] =0 perform baq computation, remember to cite the baq paper for this.

Genotype Likelihood Files

glf

A simple format for genotype likelihoods: Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,AC,AG,AT, etc

-glf [filename]
NB and remember to supply a -fai file

VCF files

VCF file as input is now included but with some limitiations. Only chr,pos,ref,alt and gp/gl tags are being used, and we discard indels and non diallelic sites. Futhermore you are required to include a fai file and the number of individuals.

#for using GL tags
./angsd -vcf-gl ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two
#for using GP tags
./angsd -vcf-gp ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two

Genotype Probability Files

Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2].

NB you are required to supply a fai file. Otherwise the program will give the following warning (this will be changed in future versions).

Brief Overview

./angsd -beagle
Command:
./angsd -beagle 
	-> angsd version: 0.569	 build(Dec 11 2013 17:27:32)
----------------
beagleReader.cpp:
	-beagle	(null)	(Beagle Filename (can be .gz))
	-intName=0	(Assume First column is chr_position)
	Use -chunkSize for defining how many sites to use at a time
	Use -fai to supply a fai file

options

To include a beagle file use the options

-beagle [fileName]

beagle file name. The file must be gzipped.

-intName [int]

default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.

Example

The file format is a single linje per site. The first 3 coloums are

  • markerName
  • alleleA
  • alleleB

For each individual 3 coloums are added. These three colums should sum to one.

Example of a file with two individuals

marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
chr9_95759152 C A 1 0 0 0 1 0
chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029
chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064

Example of estimating allele frequencies from beagle files

./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai

the reference fai can be optained by indexing the reference genome or by using a bam files header

samtools view -H  bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai