本片笔记记录我学习用于预测蛋白质特性的机器学习:全面综述 - ScienceDirect 该文章所收集到的有用信息

常用的蛋白质序列、结构和功能数据库

Application Database Disk Space Description Link
Protein Structure AlphaFoldDB 23TB Store 3D structural data of proteins. https://alphafold.com/
BioLiP 100GB A semi-manually curated database for high-quality data. https://zhanggroup.org/BioLiP/
ChEMBL 2TB A database of bioactive molecules with drug-like properties. https://www.ebi.ac.uk/chembl/
PDB 50TB 3D structures (protein, nucleic acids). https://www.rcsb.org/
PubChem 300TB Including chemical and physical characteristics. https://pubchem.ncbi.nlm.nih.gov/
STRING 20TB 59.3M protein data. https://string-db.org/
UniProt 20TB Composed of UniParc, UniRef, and UniProtKB. https://www.uniprot.org/
ZINC 17GB Contains over 35M purchasable compounds. https://zinc12.docking.org/
BFD 270GB Created by clustering 2.5B protein sequences. https://bfd.mmsegs.com/
Protein Function BRENDA 63MB Contains more than 84K proteins and 104K enzyme entries. https://www.brenda-enzymes.org/
ExPASy ENZYME 23MB Describes characteristics of enzymes and metabolic pathways. https://enzyme.expasy.org/
GO knowledge-base 500MB Includes 4.2K GO terms, 7.6M annotations, 1.5M gene products, and 5.4K species. https://www.geneontology.org/
KEGG ENZYME - Provides enzyme classification, function, and metabolic channel information. https://www.genome.jp/kegg/
PDBbind 3.5GB Provides binding affinity data for 23K biomolecular complexes. http://pdbbind.org.cn/
SABIO-RK - Kinetic data of biochemical reactions. http://sabio.h-its.org/
ScOPe 175MB Protein structure classification database. https://scop.berkeley.edu/
Protein-Ligand Interaction DUD-E 2.6GB 22K active compounds and their affinities against 102 targets. https://dude.docking.org/
CrossDocked2020 90GB More than 22M ligand-protein affinity scores. https://bits.csb.pitt.edu/files/crossdock2020/
BindingDB 4.45GB Contains about 2.8M binding data for 9.3K proteins. https://www.bindingdb.org/
STITCH 379.35GB A resource to explore known and predicted interactions between chemicals. http://stitch.embl.de/
Protein Physicochemical Properties ProTherm - Contains experimentally determined thermodynamic parameters of proteins. https://web.iitm.ac.in/bioinfo2/prothermdb
Pfam 416GB A collection of protein families, containing 21K entries. http://pfam-legacy.xfam.org/
ThermoMutDB 122MB Containing over 14K experimental data of thermodynamic properties. https://biosig.lab.uq.edu.au/thermomutdb/
Aggrescan3D 287MB Provides analysis of solubility and aggregation propensities. https://biocomp.chem.uw.edu.pl/A3D2/hproteome

备注:

  • 表中的磁盘空间是压缩格式下的估计数据。
  • 如果某些字段显示为“-”,表示相关数据可能无法直接获取。

现有的蛋白质和分子预训练模型:

Application Classification Method Input Architecture Database Year
Protein Structure Prediction Secondary structure prediction CFLM [96] Sequence RDN PDB, CASP 2023
CondGCNN [95] Sequence CNN, RNN CullPDB, CB513, CASP 2022
Tertiary structure prediction Cerebra [112] Sequence, MSA Transformer PDB, CAMEO 2024
Evo [108] Sequence StripedHyena OpenGenome 2024
ESMFold [67] Sequence Transformer UniRef 2022
trRosettaX-Single [102] Sequence Transformer, CNN PDB, Orphan54 2022
RoseTTAFold [101] Sequence, Structure, MSA Transformer UniProt, BFD 2021
AlphaFold2 [100] Sequence, Structure, MSA Transformer UniProt, PDB, BFD 2021
Protein Function Prediction GO prediction Struct2GO [115] Sequence, Structure GNN, RNN EMBL-EBI, GO 2023
DeepFRI [118] Sequence, Structure GNN GO, PDB, CSA 2021
EC prediction CLEAN [120] Sequence CNN UniProt 2023
HDMLF [121] Sequence RNN UniProt 2023
EC&GO prediction GearNet [7] Structure GNN EMBL-EBI, UniProt, GO 2022
Protein-Ligand Interactions Affinity prediction PocketAnchor [122] Sequence, Structure GNN CASF, PDBbind 2023
GraphscoreDTA [123] Sequence, Structure GNN SIFTS, PDBbind 2023
ProtNet [124] Sequence, Structure GNN PDBbind 2023
Transformer-M [63] Sequence, Structure Transformer PDBbind 2022
3D-CNNs, SG-CNNs [125] Sequence, Structure CNN PDBbind 2021
DeepAffinity [126] Sequence RNN, CNN, GNN PDBbind, UniRef 2019
Protein Physicochemical Properties Prediction Stability RaSP [131] Sequence, Structure CNN ThermoMutDB 2024
Solubility PON-Sol2 [134] Sequence Random Forest PON-Sol 2021
SoluProt [135] Sequence CNN Target-Track 2021
Subcellular localization DeepLoc2.0 [136] Sequence LSTM UniProt 2022
Binding energy super-HTS [137] Sequence GNN Created by Rosetta 2016

非冗余蛋白质序列数据库

UniRef 数据集是从 UniProt 和选定的 UniParc 记录中提取的序列簇的集合,旨在通过隐藏冗余序列来覆盖多个分辨率级别的序列空间。该数据集根据 100%、90% 和 50% 序列相似性标准对序列进行聚类,从而通过相似序列的聚类来更快地进行序列比对。BFD 数据库包含超过 2.5B 序列,涵盖细菌、古细菌、真核生物和其他生物界,确保数据多样性并有助于捕获蛋白质功能和结构的广泛变化。UniRef 和 BFD 广泛用于 AlphaFold2 等蛋白质结构预测方法。

非冗余蛋白质结构数据库

一些模型可能会利用相似蛋白质的结构信息来帮助进行结构预测。最大的结构数据库蛋白质数据库 (PDB) 广泛用于收集 3D 蛋白质结构、核酸及其复合物,目前包含近 218K 的实际结构条目。这些数据大多是通过 X 射线晶体学、核磁共振 (NMR) 波谱和冷冻电子显微镜 (cryo-EM) 等实验方法获得的。PDB70 是从 PDB 数据库生成的特定数据集,它使用聚类算法将相似的蛋白质结构组织成簇。每个簇代表一组序列相似度高的蛋白质结构,从而提高模板搜索时的搜索效率。

关于RNN和LSTM

众所周知,RNN 容易出现梯度爆炸和消失的问题,并且在处理长序列数据时难以捕获长期依赖性。为了解决这些问题,引入了长短期记忆网络 (LSTMs) 架构。LSTM 可以看作是 RNN 的整体优化,引入了门控单元和存储机制。门控单元包括忘记门、输入门和输出门,控制信息流并允许 LSTM 选择性地记住或忘记信息。与传统的 RNN 不同,LSTM 在处理序列数据时计算三个门控单元的值,从而决定是忘记、更新还是输出信息。随后,LSTM 根据这些值更新存储单元的内容,从而能够选择性地记住或忘记信息,从而解决了在处理长序列数据时捕获长期关系的问题。一些研究方法对原始 LSTM 进行了改进,主要区别在于门控机制、存储单元和参数共享方法的选择和修改。与 CNN 一样,LSTM 适用于处理蛋白质序列数据,并已用于蛋白质功能预测和蛋白质-配体相互作用预测等任务。

现有蛋白质的研究范式

image-20241222213448154

使用AlphaFold2的结构预测,聚类研究是否存在酶的新功能(一篇Cell工作)

最近,Huang 等人介绍了一种基于结构的蛋白质聚类方法,用于发现脱氨酶功能并鉴定新的脱氨酶家族(Discovery of deaminase functions by structure-based protein clustering: Cell).他们应用 AlphaFold2 来预测蛋白质结构,随后通过结构比对根据预测的结构相似性对整个脱氨酶蛋白家族进行聚类。他们发现了脱氨酶蛋白和新脱氨酶的新功能;这样的发现不能通过挖掘氨基酸序列来获得。