基于illumina测序数据,使用Jellyfish估计基因组大小

一、数据

① illumina测序数据:(细菌Staphylococcus aureus,基因组约3M,180bp文库,101PE测序,1294104条reads,测序碱基数据量约为130Mb,即基因组覆盖度45X。)

http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz

http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz

②  jellyfish 软件:(用来统计kmer频数。)

ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.4.tar.gz

③jellyplot.pl软件:(用来画kmer图,以及估计基因组大小。)

https://github.com/josephryan/estimate_genome_size.pl

④fastx toolkit:(用来转换fastq和fasta格式文件。)

hannonlab.cshl.edu/fastx_toolkit/

 

——————

下面是实际运行代码:

cat frag_1.fastq frag_2.fastq > all.fastq
fastq_to_fasta -i all.fastq -o all.fasta
jellyfish count -m 17 -s 100M -t 10 -C all.fasta
jellyfish histo mer_counts.jf -o out.2.txt
jellyplot.pl out.2.txt
estimate_genome_size.pl -kmer 17 -peak 15 -fastq all.fastq.gz

 

结果:(预估基因组为7M,比实际大了一倍。可能是原始reads没有校正和过滤的原因。)
$ estimate_genome_size.pl -kmer 17 -peak 15 -fastq all.fastq.gz
# running version 0.04 of estimate_genome_size.pl
# run with this command: estimate_genome_size.pl

TOTAL_NTS: 131998608
Estimated Coverage: 17.7906976744186
Estimated Genome Size: 7419529.6

QQ截圖20141217130925

使用校正之后的数据,(使用quake.py进行校正http://gage.cbcb.umd.edu/data/Staphylococcus_aureus)

估计基因组大小为:

# running version 0.04 of estimate_genome_size.pl
# run with this command: /home/fenglei/perl5/bin/estimate_genome_size.pl

TOTAL_NTS: 59609610
Estimated Coverage: 15.3684210526316
Estimated Genome Size: 3878707.5

这次距离实际情况比较近。

 

 

===================

附:jellyfish 参数意义

jellyfish count -m 17 -s 100M -t 10 -C all.fasta

-m是kmer长度;

-s是预估哈希表的大小,即G+G*c*e*k。G是Genome Size;c是coverage(genome survey测序通常低于100x);e是测序错误率(illumina为1%);k是kmer大小。

-C表示考虑DNA正义与反义链,遇到反义kmer时,计入正义kmer频数中。

关于

Bioinformatics

发表在 Bioinformatics

留下评论

Note
Learning & Sharing.