本片笔记记录我学习用于预测蛋白质特性的机器学习:全面综述 - ScienceDirect 该文章所收集到的有用信息
常用的蛋白质序列、结构和功能数据库
Application | Database | Disk Space | Description | Link |
---|---|---|---|---|
Protein Structure | AlphaFoldDB | 23TB | Store 3D structural data of proteins. | https://alphafold.com/ |
BioLiP | 100GB | A semi-manually curated database for high-quality data. | https://zhanggroup.org/BioLiP/ | |
ChEMBL | 2TB | A database of bioactive molecules with drug-like properties. | https://www.ebi.ac.uk/chembl/ | |
PDB | 50TB | 3D structures (protein, nucleic acids). | https://www.rcsb.org/ | |
PubChem | 300TB | Including chemical and physical characteristics. | https://pubchem.ncbi.nlm.nih.gov/ | |
STRING | 20TB | 59.3M protein data. | https://string-db.org/ | |
UniProt | 20TB | Composed of UniParc, UniRef, and UniProtKB. | https://www.uniprot.org/ | |
ZINC | 17GB | Contains over 35M purchasable compounds. | https://zinc12.docking.org/ | |
BFD | 270GB | Created by clustering 2.5B protein sequences. | https://bfd.mmsegs.com/ | |
Protein Function | BRENDA | 63MB | Contains more than 84K proteins and 104K enzyme entries. | https://www.brenda-enzymes.org/ |
ExPASy ENZYME | 23MB | Describes characteristics of enzymes and metabolic pathways. | https://enzyme.expasy.org/ | |
GO knowledge-base | 500MB | Includes 4.2K GO terms, 7.6M annotations, 1.5M gene products, and 5.4K species. | https://www.geneontology.org/ | |
KEGG ENZYME | - | Provides enzyme classification, function, and metabolic channel information. | https://www.genome.jp/kegg/ | |
PDBbind | 3.5GB | Provides binding affinity data for 23K biomolecular complexes. | http://pdbbind.org.cn/ | |
SABIO-RK | - | Kinetic data of biochemical reactions. | http://sabio.h-its.org/ | |
ScOPe | 175MB | Protein structure classification database. | https://scop.berkeley.edu/ | |
Protein-Ligand Interaction | DUD-E | 2.6GB | 22K active compounds and their affinities against 102 targets. | https://dude.docking.org/ |
CrossDocked2020 | 90GB | More than 22M ligand-protein affinity scores. | https://bits.csb.pitt.edu/files/crossdock2020/ | |
BindingDB | 4.45GB | Contains about 2.8M binding data for 9.3K proteins. | https://www.bindingdb.org/ | |
STITCH | 379.35GB | A resource to explore known and predicted interactions between chemicals. | http://stitch.embl.de/ | |
Protein Physicochemical Properties | ProTherm | - | Contains experimentally determined thermodynamic parameters of proteins. | https://web.iitm.ac.in/bioinfo2/prothermdb |
Pfam | 416GB | A collection of protein families, containing 21K entries. | http://pfam-legacy.xfam.org/ | |
ThermoMutDB | 122MB | Containing over 14K experimental data of thermodynamic properties. | https://biosig.lab.uq.edu.au/thermomutdb/ | |
Aggrescan3D | 287MB | Provides analysis of solubility and aggregation propensities. | https://biocomp.chem.uw.edu.pl/A3D2/hproteome |
备注:
- 表中的磁盘空间是压缩格式下的估计数据。
- 如果某些字段显示为“-”,表示相关数据可能无法直接获取。
现有的蛋白质和分子预训练模型:
Application | Classification | Method | Input | Architecture | Database | Year |
---|---|---|---|---|---|---|
Protein Structure Prediction | Secondary structure prediction | CFLM [96] | Sequence | RDN | PDB, CASP | 2023 |
CondGCNN [95] | Sequence | CNN, RNN | CullPDB, CB513, CASP | 2022 | ||
Tertiary structure prediction | Cerebra [112] | Sequence, MSA | Transformer | PDB, CAMEO | 2024 | |
Evo [108] | Sequence | StripedHyena | OpenGenome | 2024 | ||
ESMFold [67] | Sequence | Transformer | UniRef | 2022 | ||
trRosettaX-Single [102] | Sequence | Transformer, CNN | PDB, Orphan54 | 2022 | ||
RoseTTAFold [101] | Sequence, Structure, MSA | Transformer | UniProt, BFD | 2021 | ||
AlphaFold2 [100] | Sequence, Structure, MSA | Transformer | UniProt, PDB, BFD | 2021 | ||
Protein Function Prediction | GO prediction | Struct2GO [115] | Sequence, Structure | GNN, RNN | EMBL-EBI, GO | 2023 |
DeepFRI [118] | Sequence, Structure | GNN | GO, PDB, CSA | 2021 | ||
EC prediction | CLEAN [120] | Sequence | CNN | UniProt | 2023 | |
HDMLF [121] | Sequence | RNN | UniProt | 2023 | ||
EC&GO prediction | GearNet [7] | Structure | GNN | EMBL-EBI, UniProt, GO | 2022 | |
Protein-Ligand Interactions | Affinity prediction | PocketAnchor [122] | Sequence, Structure | GNN | CASF, PDBbind | 2023 |
GraphscoreDTA [123] | Sequence, Structure | GNN | SIFTS, PDBbind | 2023 | ||
ProtNet [124] | Sequence, Structure | GNN | PDBbind | 2023 | ||
Transformer-M [63] | Sequence, Structure | Transformer | PDBbind | 2022 | ||
3D-CNNs, SG-CNNs [125] | Sequence, Structure | CNN | PDBbind | 2021 | ||
DeepAffinity [126] | Sequence | RNN, CNN, GNN | PDBbind, UniRef | 2019 | ||
Protein Physicochemical Properties Prediction | Stability | RaSP [131] | Sequence, Structure | CNN | ThermoMutDB | 2024 |
Solubility | PON-Sol2 [134] | Sequence | Random Forest | PON-Sol | 2021 | |
SoluProt [135] | Sequence | CNN | Target-Track | 2021 | ||
Subcellular localization | DeepLoc2.0 [136] | Sequence | LSTM | UniProt | 2022 | |
Binding energy | super-HTS [137] | Sequence | GNN | Created by Rosetta | 2016 |
非冗余蛋白质序列数据库
UniRef 数据集是从 UniProt 和选定的 UniParc 记录中提取的序列簇的集合,旨在通过隐藏冗余序列来覆盖多个分辨率级别的序列空间。该数据集根据 100%、90% 和 50% 序列相似性标准对序列进行聚类,从而通过相似序列的聚类来更快地进行序列比对。BFD 数据库包含超过 2.5B 序列,涵盖细菌、古细菌、真核生物和其他生物界,确保数据多样性并有助于捕获蛋白质功能和结构的广泛变化。UniRef 和 BFD 广泛用于 AlphaFold2 等蛋白质结构预测方法。
非冗余蛋白质结构数据库
一些模型可能会利用相似蛋白质的结构信息来帮助进行结构预测。最大的结构数据库蛋白质数据库 (PDB) 广泛用于收集 3D 蛋白质结构、核酸及其复合物,目前包含近 218K 的实际结构条目。这些数据大多是通过 X 射线晶体学、核磁共振 (NMR) 波谱和冷冻电子显微镜 (cryo-EM) 等实验方法获得的。PDB70 是从 PDB 数据库生成的特定数据集,它使用聚类算法将相似的蛋白质结构组织成簇。每个簇代表一组序列相似度高的蛋白质结构,从而提高模板搜索时的搜索效率。
关于RNN和LSTM
众所周知,RNN 容易出现梯度爆炸和消失的问题,并且在处理长序列数据时难以捕获长期依赖性。为了解决这些问题,引入了长短期记忆网络 (LSTMs) 架构。LSTM 可以看作是 RNN 的整体优化,引入了门控单元和存储机制。门控单元包括忘记门、输入门和输出门,控制信息流并允许 LSTM 选择性地记住或忘记信息。与传统的 RNN 不同,LSTM 在处理序列数据时计算三个门控单元的值,从而决定是忘记、更新还是输出信息。随后,LSTM 根据这些值更新存储单元的内容,从而能够选择性地记住或忘记信息,从而解决了在处理长序列数据时捕获长期关系的问题。一些研究方法对原始 LSTM 进行了改进,主要区别在于门控机制、存储单元和参数共享方法的选择和修改。与 CNN 一样,LSTM 适用于处理蛋白质序列数据,并已用于蛋白质功能预测和蛋白质-配体相互作用预测等任务。
现有蛋白质的研究范式
使用AlphaFold2的结构预测,聚类研究是否存在酶的新功能(一篇Cell工作)
最近,Huang 等人介绍了一种基于结构的蛋白质聚类方法,用于发现脱氨酶功能并鉴定新的脱氨酶家族(Discovery of deaminase functions by structure-based protein clustering: Cell).他们应用 AlphaFold2 来预测蛋白质结构,随后通过结构比对根据预测的结构相似性对整个脱氨酶蛋白家族进行聚类。他们发现了脱氨酶蛋白和新脱氨酶的新功能;这样的发现不能通过挖掘氨基酸序列来获得。
Comments NOTHING