MACHINE LEARNING-BASED METHODS FOR THE IDENTIFICATION AND LOCALIZATION OF RNA

  • Saleh Musleh

Student thesis: Doctoral Dissertation

Abstract

This work challenges the conventional protein-centric view of genomes by highlighting the critical functions of both messenger RNA (mRNA) and long non-coding RNAs (lncRNAs). LncRNAs are essential regulators of cellular processes like gene expression, chromatin remodeling, and cell differentiation, despite the fact that they do not encode proteins. The catalog of lncRNAs has grown due to the quick development of next-generation sequencing technology, yet it is still difficult to conduct a comprehensive analysis of this expanding collection. In order to solve this, we built a machine learning-based tool called the Long Non-Coding RNA Identifier (LNCRI), which accurately separates human lncRNA transcripts from human protein-coding mRNA transcrpts with over 92\% and 98\% accuracy for mRNA and lncRNA, respectively. We also focused on GENCODE annotation as a comprehensive pathway for accessing genomic annotation data. LNCRI outperforms current models for cross-species (human, mouse, chimpanzee, monkey, gorilla, orangutan, cow, pig, frog and zebrafish) prediction by utilizing characteristics such as weighted k-mer, pseudo nucleotide composition, hexamer use bias, Fickett score, open reading frame, UTR region information, and HMMER score. The spatial and temporal expression of genes is determined by mRNA localization mechanisms, which also affect important processes including cell polarity, development, and synaptic plasticity. We propose two machine learning methods, the mRNA Subcellular Localization Predictor (MSLP) and the Unified mRNA Subcellular Localization Predictor (UMSLP), to predict mRNA subcellular localization with high accuracy. MSLP uses ensemble-based models, while UMSLP uses k-mer, pseudo-k-tuple nucleotide composition and Z-curve transformation for over 87\% precision, 94\% specificity, and 94\% accuracy in five subcellular regions. Our tools, LNCRI, MSLP, UMSLP, and GitHub and Docker API, will aid the scientific community in accurately identifying lncRNA and mRNA transcripts and forecasting mRNA localization.
Date of Award2024
Original languageAmerican English
Awarding Institution
  • HBKU College of Science and Engineering

Keywords

  • Genetics
  • lncRNA
  • Machine Learning
  • Multiple Species
  • RNA
  • Subcellular Localization

Cite this

'