SARST2 high-throughput and resource-efficient protein structure alignment against massive databases

羅惟正教授研究團隊發表研究成果於 Nature Communications

連結網址：

https://www.nature.com/articles/s41467-025-63757-9

Abstract

The flood of protein structural Big Data is coming. With the belief that biotech researchers deserve powerful analysis engines to overcome the challenge of rapidly increasing computational demands, we are devoted to developing efficient protein structural alignment search algorithms to assist researchers as they push the frontiers of biological sciences and technology. Here, we present SARST2, an algorithm that integrates primary, secondary, and tertiary structural features with evolutionary statistics to perform accurate and rapid alignments. In large-scale benchmarks, SARST2 outperforms state-of-the-art methods in accuracy, while completing AlphaFold Database searches significantly faster and with substantially less memory than BLAST and Foldseek. It employs a filter-and-refine strategy enhanced by machine learning, a diagonal shortcut for word-matching, a weighted contact number-based scoring scheme, and a variable gap penalty based on substitution entropy. SARST2, implemented in Golang as standalone programs available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst, enables massive database searches using even ordinary personal computers.

中文簡介

蛋白質的功能取決於其結構，解析結構有助人們了解功能形成之機制，研發蛋白質藥物與仿生分子材料。然而，結構解析困難，2020年之前，已知的上億筆蛋白質序列中，僅有十數萬筆結構確知。2020年, Google-DeepMind 發表了精準結構預測演算法AlphaFold2, 並宣告將對當時兩億多筆序列做預測。本實驗室意識到，蛋白質結構資料將暴增千倍，比對分析會非常耗時，於是著手研發高效能結構比對搜尋平行運算方法，協助國際蛋白質科研團隊解決龐大計算壓力。我們的SARST2演算法，效能不僅數百至數萬倍高於早年方法，更能在個人電腦上以三分鐘完成兩億多筆AlphaFold資料庫的比對搜尋，且僅用極少資源。相同任務，近年知名的平行運算方法Foldseek需要六倍時間、四十倍記憶體及三倍磁碟空間。感謝國科會、教育部及母院工程生物科學學院於各方面的支持，本團隊將持續努力推廣平行運算技術於蛋白質結構分析之應用，盼推助臺灣在高通量運算領域引領國際。