There is increasing evidence that non-coding RNAs (ncRNAs) play significant roles in multiple fundamental biological processes. In particular, ncRNA interactions provide valuable insights into protein synthesis, gene expression regulation, RNA processing, and localization control. The dysregulation of ncRNA interactions may lead to severe human diseases such as cancer, cardiovascular disorders, immune dysfunction, and inflammatory diseases. Multiple experimental techniques have been invented to identify ncRNA-Protein interactions (ncRPIs), they are currently still expensive and time-consuming. To overcome the challenges of biological experiment techniques, multiple computational methods to identify RNA-Protein interactions (RPIs) have been proposed. Because of their powerful feature learning capabilities, deep learning (DL) models have emerged as the industry standard for a variety of biological sequence analysis issues, and we develop a DL model for the same purpose. To feed into DL model, we encoded each RNA sequence by using 4-mer frequency to obtain 256-dimensional feature vector. For each protein, its sequence was encoded based on the reduced 20 amino acids into 7-letters alphabet and then the 3-mer Conjoint Triad Feature (CTF) is employed to extract 343-dimensional feature vector. For the development of DL model, we proposed a new computational model, RPI_SDA-XGBoost, using stacked denoising auto-encoder architecture to extract the advanced features from two separate networks for protein and ncRNA. Then extracted features from ncRNA and protein were fed into a random forest (RF) classifier. Moreover, we used a stacked ensembling strategy to combine different outputs to train the XGBoost classifier, improving the prediction performance of the proposed model. We validate the proposed model on five benchmark datasets RPI 369, RPI 488, RPI 1807, RPI 2241, and NPInter v2.0. RPI_SDA-XGBoost performed very well in predicting ncRNA-protein interaction on all datasets and outperformed the majority of previous methods. The proposed method obtained an area under the curve (AUC) of 0.7299, 0.8951, 0.9923, 0.8902, and 0.9563, respectively on five ncRPIs datasets. Also, in the two largest datasets, RPI 2241 and RPI NPInter v2.0, the suggested method obtained the highest precision of 87.9% and 94.6%, respectively. We believe the proposed model provides useful direction for upcoming biological research and helps the community using more sophisticated computational approaches for ncRPI problem.
| Date of Award | 2025 |
|---|
| Original language | American English |
|---|
| Awarding Institution | - HBKU College of Science and Engineering
|
|---|
A DEEP LEARNING MODEL TO PREDICT THE ncRNA-PROTEIN INTERACTIONS BASED ON SEQUENCES INFORMATION ONLY
Sewailem, M. (Author). 2025
Student thesis: Master's Dissertation