Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads

时间：2022-10-04 23:00:00 450novo传感器

混合和非混合法是对的nanopore reads从头组装的评价

动机:

纳米孔测序技术的出现挑战了现有的组装方法。在这项工作中，我们评估了现有混合和非混合组装方法在长而容易出错的纳米孔读取中的性能。

结果:

我们基准测试了序数据测试了5条非混合管道(在错误纠正和支架方面)和两个混合装配器来支撑Illumina装配。采用多个纳米孔数据(20、30、40、50)的测序覆盖，对多个公开可用的大肠杆菌进行覆盖K-12的MinION和Illumina测试了数据集。为了估计封闭细菌基因组装的需要，我们试图评估每个覆盖物的组装质量。在此基础上，开发了可扩展的基因组装基准测试框架。结果表明，混合方法对NGS数据质量依赖性强，对纳米孔数据质量和覆盖性依赖性弱，纳米孔覆盖性低。当覆盖率超过40时，所有非杂交方法都可以正确组装大肠杆菌基因组，即使是为太平洋生物科学量身定制的非杂交方法。虽然是专门的nanopore与读取设计相比，它需要更高的覆盖率，但其运行时间明显较低。

在过去的十年里，下一代测序(NGS)该设备主导了基因组测序市场。以前使用过Sanger测序相比，NGS更便宜，更省时，不需要很多劳动力。然而，当涉及到更长基因组的从头组装时，许多研究人员使用它NGS阅读表示怀疑。这些设备产生数百个碱基对长，即使在相对较小的微生物基因组中，也无法毫不含糊地解决重复区域。(Nagarajan和Pop, 2013)。虽然使用配对和配对技术提高了组装基因组的准确性和完整性，但由于重复区域长，NGS测序仍然产生高度碎片化的组装。这些不完整的基因组必须完成这些不完整的基因组，包括桑格测序和专门定制的组装方法。由于NGS的存在，人们开发了许多有效的算法来优化序列装配、比对和下游分析步骤中的运行时间和内存占用。

有一种新的测序方法，即所谓的第三代测序技术”。
第一个是太平洋生物科学公司（pacbio）单分子测序技术的开发。
尽管pacbio测序器产生较长的读取时间(多达数万个碱基对)，但读取错误率(10%至15%)明显高于NGS读数（1%）（Schirmer等人，2015年)。
现有的组装和对准算法无法处理如此高的错误率。
这导致了错误阅读校正方法的发展。首先，使用补充剂NGS（Illumina）混合校正数据（Koren等人，2012）。
后来开发了pacbionly读取的自校正（chin等人，2013年)，需要更高的覆盖率（>50x）。
需要开发新的、更敏感的微传感器(即(2013和2012)和现有的优化(李)（Ⅱ））。
2014年，牛津纳米孔技术公司（ONT）他们的微型微型测序仪大小约是口琴。
迷你可以产生数十万碱基对的读取。迷你测序器(使用最新的R7.3.原始碱基的一维读取精度小于75%，而优质二维读取(80–88%的精度)只占二维读取的一小部分（IP等人，2015；Laver等人，2015）。
这再次刺激了开发更敏感的映射和重新调整算法的需求，如GraphMap（Sovic et al.，2016）和Marginalign（Jain et al.，2015）。
2015年，当Loman等人证明只用ONT-reads组装细菌基因组(大肠杆菌)K-12)即使错误率很高，也是可能的（Loman等人，2015）。
由于纳米孔测序技术的超长阅读时间、经济性和可用性，
在不久的将来，这些结果可能会导致从头序列分析的革命。

Majority of algorithms for de novo assembly follow either the de Bruijn graph (DBG) or the Overlap-Layout-Consensus (OLC) paradigm (Pop, 2009).OLC assemblers predate the DBG and were widely used in the Sanger sequencing era.A major representative of the OLC class is Celera which was developed and maintained until very recently.The DBG approach attempted to solve the problem of ever-growing sequencing throughput brought on by the NGS technologies.Unlike OLC in which overlaps between reads have to be calculated explicitly, DBG splits the reads into k-mers and constructs the overlap graph implicitly, e.g. through a hash table lookup.While the assembly in the OLC paradigm attempts to find a Hamiltonian path through an overlap graph, the DBG attempts to solve a, virtually, simpler problem of finding an Eulerian path through a de Bruijn graph.It was later shown that both de Bruijn and overlap graphs can be transformed into string graph form, in which, similar to the DBG, an Eulerian path also needs to be found to obtain the assembly (Myers, 2005).Major differences lie in the implementation specifics of both algorithms.Although the DBG approach is faster, OLC based algorithms perform better for longer reads (Pop, 2009).Additionally, DBG assemblers depend on finding exact-matching k-mers between reads (typically 21 to 127 bases long (Bankevich and Pevzner, 2016)).Given the error rates in third generation sequencing data, this presents a serious limitation.The OLC approach, on the other hand, should be able to cope with higher error rates given a sensitive enough overlapper, but contrary to the DBG a time-consuming all-to-all pairwise comparison between input reads needs to be performed

锐单商城拥有海量元器件数据手册、IC替代型号，打造电子元器件IC百科大全！

Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads

相关文章