Abstract
In this study, we introduce a novel de Bruijn graph (dBG) based framework for feature engineering in biological sequential data such as proteins. This framework simplifies feature extraction by dynamically generating high-quality, interpretable features for traditional AI (TAI) algorithms. Our framework accounts for amino acid substitutions by efficiently adjusting the edge weights in the dBG using a secondary trie structure. We extract motifs from the dBG by traversing the heavy edges, and then incorporate alignment algorithms like BLAST and Smith-Waterman to generate features for TAI algorithms. Empirical validation on TIMP (tissue inhibitors of matrix metalloproteinase) data demonstrates significant accuracy improvements over a robust baseline, state-of-the-art PLM models, and those from the popular GLAM2 tool. Furthermore, our framework successfully identified Glycine and Arginine-rich motifs with high coverage, highlighting it is potential in general pattern discovery.
| Original language | English |
|---|---|
| Article number | 035020 |
| Journal | Machine Learning: Science and Technology |
| Volume | 5 |
| Issue number | 3 |
| DOIs | |
| Publication status | Published - 19 Jul 2024 |
Keywords
- Bioinformatics
- Machine learning
- de Bruijn graph
Fingerprint
Dive into the research topics of 'An extended de Bruijn graph for feature engineering over biological sequential data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver