一、数据
① illumina测序数据:(细菌Staphylococcus aureus,基因组约3M,180bp文库,101PE测序,1294104条reads,测序碱基数据量约为130Mb,即基因组覆盖度45X。)
http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz
http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz
② jellyfish 软件:(用来统计kmer频数。)
ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.4.tar.gz
③jellyplot.pl软件:(用来画kmer图,以及估计基因组大小。)
https://github.com/josephryan/estimate_genome_size.pl
④fastx toolkit:(用来转换fastq和fasta格式文件。)
hannonlab.cshl.edu/fastx_toolkit/
——————
下面是实际运行代码:
cat frag_1.fastq frag_2.fastq > all.fastq
fastq_to_fasta -i all.fastq -o all.fasta
jellyfish count -m 17 -s 100M -t 10 -C all.fasta
jellyfish histo mer_counts.jf -o out.2.txt
jellyplot.pl out.2.txt
estimate_genome_size.pl -kmer 17 -peak 15 -fastq all.fastq.gz
结果:(预估基因组为7M,比实际大了一倍。可能是原始reads没有校正和过滤的原因。)
$ estimate_genome_size.pl -kmer 17 -peak 15 -fastq all.fastq.gz
# running version 0.04 of estimate_genome_size.pl
# run with this command: estimate_genome_size.pl
TOTAL_NTS: 131998608
Estimated Coverage: 17.7906976744186
Estimated Genome Size: 7419529.6
使用校正之后的数据,(使用quake.py进行校正http://gage.cbcb.umd.edu/data/Staphylococcus_aureus)
估计基因组大小为:
# running version 0.04 of estimate_genome_size.pl
# run with this command: /home/fenglei/perl5/bin/estimate_genome_size.pl
TOTAL_NTS: 59609610
Estimated Coverage: 15.3684210526316
Estimated Genome Size: 3878707.5
这次距离实际情况比较近。
===================
附:jellyfish 参数意义
jellyfish count -m 17 -s 100M -t 10 -C all.fasta
-m是kmer长度;
-s是预估哈希表的大小,即G+G*c*e*k。G是Genome Size;c是coverage(genome survey测序通常低于100x);e是测序错误率(illumina为1%);k是kmer大小。
-C表示考虑DNA正义与反义链,遇到反义kmer时,计入正义kmer频数中。
留下评论