Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE
Raw Text
Search in PubMed
Search in NLM Catalog
Add to Search
.
2023 Jan 24;21(1):12.
doi: 10.1186/s12915-023-01510-8.
Chao Wang  1 ,
Quan Zou  2
Expand
Affiliations
1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China.
2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net.
PMID: 36694239
PMCID: PMC9875434
DOI: 10.1186/s12915-023-01510-8
Free PMC article
Chao Wang  et al.
BMC Biol .
2023 .
Free PMC article
Show details
Display options
Format
Search in PubMed
Search in NLM Catalog
Add to Search
.
2023 Jan 24;21(1):12.
doi: 10.1186/s12915-023-01510-8.
Authors
Chao Wang  1 ,
Quan Zou  2
Affiliations
1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China.
2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net.
PMID: 36694239
PMCID: PMC9875434
DOI: 10.1186/s12915-023-01510-8
Cite
Display options
Format
Abstract
Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work.
Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects.
Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .
Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility.
© 2023. The Author(s).
PubMed Disclaimer
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
Fig. 1
Comparison of different feature selection…
Fig. 1
Comparison of different feature selection methods. A – E Metrices value and feature…
Fig. 2
Performance comparison of DeepSoluE and…
Fig. 2
Performance comparison of DeepSoluE and 11 conventional machine learning methods
Fig. 3
Feature contribution and dependency analysis.…
Fig. 3
Feature contribution and dependency analysis. A The 20 most important features. B Summary…
Fig. 4
The DeepSoluE workflow. A Physicochemical…
Fig. 4
The DeepSoluE workflow. A Physicochemical feature encoding, feature optimization, and distributed representation of…
See this image and copyright information in PMC
Similar articles
DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA. Wang C, Ju Y, Zou Q, Lin C. Wang C, et al. Bioinformatics. 2021 Dec 22;38(1):52-57. doi: 10.1093/bioinformatics/btab611. Bioinformatics. 2021. PMID: 34427581
Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Chang CC, Song J, Tey BT, Ramanan RN. Chang CC, et al. Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review.
Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. Wang C, Zou Q, Ju Y, Shi H. Wang C, et al. IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):967-975. doi: 10.1109/TCBB.2022.3204365. Epub 2023 Apr 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 36063523
DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Madani M, Lin K, Tarakanova A. Madani M, et al. Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555. Int J Mol Sci. 2021. PMID: 34948354 Free PMC article.
Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu D, Smith AI, Li L, Chou KC, Song J. Chen Z, et al. Brief Bioinform. 2019 Nov 27;20(6):2267-2290. doi: 10.1093/bib/bby089. Brief Bioinform. 2019. PMID: 30285084 Free PMC article. Review.
See all similar articles
Cited by
A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites. Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. Mou M, et al. Research (Wash D C). 2023 Sep 27;6:0240. doi: 10.34133/research.0240. eCollection 2023. Research (Wash D C). 2023. PMID: 37771850 Free PMC article.
MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. Cui Z, et al. Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023. Front Microbiol. 2023. PMID: 37675425 Free PMC article.
Design, evaluation, and immune simulation of potentially universal multi-epitope mpox vaccine candidate: focus on DNA vaccine. Rcheulishvili N, Mao J, Papukashvili D, Feng S, Liu C, Wang X, He Y, Wang PG. Rcheulishvili N, et al. Front Microbiol. 2023 Jul 21;14:1203355. doi: 10.3389/fmicb.2023.1203355. eCollection 2023. Front Microbiol. 2023. PMID: 37547674 Free PMC article.
MIX-TPI: a flexible prediction framework for TCR-pMHC interactions based on multimodal representations. Yang M, Huang ZA, Zhou W, Ji J, Zhang J, He S, Zhu Z. Yang M, et al. Bioinformatics. 2023 Aug 1;39(8):btad475. doi: 10.1093/bioinformatics/btad475. Bioinformatics. 2023. PMID: 37527015 Free PMC article.
Golgi_DF: Golgi proteins classification with deep forest. Bao W, Gu Y, Chen B, Yu H. Bao W, et al. Front Neurosci. 2023 May 12;17:1197824. doi: 10.3389/fnins.2023.1197824. eCollection 2023. Front Neurosci. 2023. PMID: 37250391 Free PMC article.
References
Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed
Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: An update. Pharm Res. 2010;27(4):544–575. doi: 10.1007/s11095-009-0045-6. - DOI - PubMed
Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. doi: 10.1186/1475-2859-4-11. - DOI - PMC - PubMed
Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed
Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020;36(18):4691–4698. doi: 10.1093/bioinformatics/btaa578. - DOI - PMC - PubMed
Publication types
Research Support, Non-U.S. Gov't Actions Search in PubMed Search in MeSH Add to Search
MeSH terms
Amino Acid Sequence Actions Search in PubMed Search in MeSH Add to Search
Computational Biology / methods Actions Search in PubMed Search in MeSH Add to Search
Escherichia coli* / genetics Actions Search in PubMed Search in MeSH Add to Search
Escherichia coli* / metabolism Actions Search in PubMed Search in MeSH Add to Search
Protein Processing, Post-Translational Actions Search in PubMed Search in MeSH Add to Search
Proteins* / metabolism Actions Search in PubMed Search in MeSH Add to Search
Solubility Actions Search in PubMed Search in MeSH Add to Search
Substances
Proteins Actions Search in PubMed Search in MeSH Add to Search
Related information
MedGen
Grants and funding
62002051/National Natural Science Foundation of China
62131004/National Natural Science Foundation of China
62272065/National Natural Science Foundation of China
62250028/National Natural Science Foundation of China
LinkOut - more resources
Full Text Sources BioMed Central Europe PubMed Central PubMed Central
Miscellaneous NCI CPTAC Assay Portal
Single Line Text
Search in PubMed. Search in NLM Catalog. Add to Search. . 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8. Chao Wang  1 , Quan Zou  2. Expand. Affiliations. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China. 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net. PMID: 36694239. PMCID: PMC9875434. DOI: 10.1186/s12915-023-01510-8. Free PMC article. Chao Wang  et al. BMC Biol . 2023 . Free PMC article. Show details. Display options. Format. Search in PubMed. Search in NLM Catalog. Add to Search. . 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8. Authors. Chao Wang  1 , Quan Zou  2. Affiliations. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China. 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net. PMID: 36694239. PMCID: PMC9875434. DOI: 10.1186/s12915-023-01510-8. Cite. Display options. Format. Abstract. Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work. Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects. Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ . Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility. © 2023. The Author(s). PubMed Disclaimer. Conflict of interest statement. The authors declare that they have no competing interests. Figures. Fig. 1. Comparison of different feature selection… Fig. 1. Comparison of different feature selection methods. A – E Metrices value and feature… Fig. 2. Performance comparison of DeepSoluE and… Fig. 2. Performance comparison of DeepSoluE and 11 conventional machine learning methods. Fig. 3. Feature contribution and dependency analysis.… Fig. 3. Feature contribution and dependency analysis. A The 20 most important features. B Summary… Fig. 4. The DeepSoluE workflow. A Physicochemical… Fig. 4. The DeepSoluE workflow. A Physicochemical feature encoding, feature optimization, and distributed representation of… See this image and copyright information in PMC. Similar articles. DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA. Wang C, Ju Y, Zou Q, Lin C. Wang C, et al. Bioinformatics. 2021 Dec 22;38(1):52-57. doi: 10.1093/bioinformatics/btab611. Bioinformatics. 2021. PMID: 34427581. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Chang CC, Song J, Tey BT, Ramanan RN. Chang CC, et al. Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review. Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. Wang C, Zou Q, Ju Y, Shi H. Wang C, et al. IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):967-975. doi: 10.1109/TCBB.2022.3204365. Epub 2023 Apr 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 36063523. DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Madani M, Lin K, Tarakanova A. Madani M, et al. Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555. Int J Mol Sci. 2021. PMID: 34948354 Free PMC article. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu D, Smith AI, Li L, Chou KC, Song J. Chen Z, et al. Brief Bioinform. 2019 Nov 27;20(6):2267-2290. doi: 10.1093/bib/bby089. Brief Bioinform. 2019. PMID: 30285084 Free PMC article. Review. See all similar articles. Cited by. A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites. Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. Mou M, et al. Research (Wash D C). 2023 Sep 27;6:0240. doi: 10.34133/research.0240. eCollection 2023. Research (Wash D C). 2023. PMID: 37771850 Free PMC article. MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. Cui Z, et al. Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023. Front Microbiol. 2023. PMID: 37675425 Free PMC article. Design, evaluation, and immune simulation of potentially universal multi-epitope mpox vaccine candidate: focus on DNA vaccine. Rcheulishvili N, Mao J, Papukashvili D, Feng S, Liu C, Wang X, He Y, Wang PG. Rcheulishvili N, et al. Front Microbiol. 2023 Jul 21;14:1203355. doi: 10.3389/fmicb.2023.1203355. eCollection 2023. Front Microbiol. 2023. PMID: 37547674 Free PMC article. MIX-TPI: a flexible prediction framework for TCR-pMHC interactions based on multimodal representations. Yang M, Huang ZA, Zhou W, Ji J, Zhang J, He S, Zhu Z. Yang M, et al. Bioinformatics. 2023 Aug 1;39(8):btad475. doi: 10.1093/bioinformatics/btad475. Bioinformatics. 2023. PMID: 37527015 Free PMC article. Golgi_DF: Golgi proteins classification with deep forest. Bao W, Gu Y, Chen B, Yu H. Bao W, et al. Front Neurosci. 2023 May 12;17:1197824. doi: 10.3389/fnins.2023.1197824. eCollection 2023. Front Neurosci. 2023. PMID: 37250391 Free PMC article. References. Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed. Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: An update. Pharm Res. 2010;27(4):544–575. doi: 10.1007/s11095-009-0045-6. - DOI - PubMed. Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. doi: 10.1186/1475-2859-4-11. - DOI - PMC - PubMed. Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed. Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020;36(18):4691–4698. doi: 10.1093/bioinformatics/btaa578. - DOI - PMC - PubMed. Publication types. Research Support, Non-U.S. Gov't Actions Search in PubMed Search in MeSH Add to Search. MeSH terms. Amino Acid Sequence Actions Search in PubMed Search in MeSH Add to Search. Computational Biology / methods Actions Search in PubMed Search in MeSH Add to Search. Escherichia coli* / genetics Actions Search in PubMed Search in MeSH Add to Search. Escherichia coli* / metabolism Actions Search in PubMed Search in MeSH Add to Search. Protein Processing, Post-Translational Actions Search in PubMed Search in MeSH Add to Search. Proteins* / metabolism Actions Search in PubMed Search in MeSH Add to Search. Solubility Actions Search in PubMed Search in MeSH Add to Search. Substances. Proteins Actions Search in PubMed Search in MeSH Add to Search. Related information. MedGen. Grants and funding. 62002051/National Natural Science Foundation of China. 62131004/National Natural Science Foundation of China. 62272065/National Natural Science Foundation of China. 62250028/National Natural Science Foundation of China. LinkOut - more resources. Full Text Sources BioMed Central Europe PubMed Central PubMed Central. Miscellaneous NCI CPTAC Assay Portal.