Open Access Highly Accessed Open Badges Technical Note

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Ruibang Luo12, Binghang Liu12, Yinlong Xie123, Zhenyu Li12, Weihua Huang1, Jianying Yuan1, Guangzhu He1, Yanxiang Chen1, Qi Pan1, Yunjie Liu1, Jingbo Tang1, Gengxiong Wu1, Hao Zhang1, Yujian Shi1, Yong Liu1, Chang Yu1, Bo Wang1, Yao Lu1, Changlei Han1, David W Cheung2, Siu-Ming Yiu2, Shaoliang Peng4, Zhu Xiaoqian4, Guangming Liu4, Xiangke Liao4, Yingrui Li12, Huanming Yang1, Jian Wang1, Tak-Wah Lam2* and Jun Wang1*

Author Affiliations

1 BGI HK Research Institute, 16 Dai Fu Street, Tai Po Industrial Estate, Hong Kong

2 HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory & Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong

3 School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, 510006, China

4 School of Computer Science, National University of Defense Technology, No.47, Yanwachi street, Kaifu District, Changsha, Hunan, 410073, China

For all author emails, please log on.

GigaScience 2012, 1:18  doi:10.1186/2047-217X-1-18

Published: 27 December 2012



There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.


To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.


Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

Genome; Assembly; Contig; Scaffold; Error correction; Gap-filling