Active Prompt Caching in Edge Networks for Generative AI and LLMs: An RL-Based Approach

  • Emna Baccour
  • , Aiman Erbad
  • , Amr Mohamed
  • , Mounir Hamdi
  • , Mohsen Guizani

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Generative AI (GAI) and Large Language Models (LLMs) have revolutionized natural language processing and content creation. However, their significant computational demands during inference often require cloud servers, which are currently the only viable option for handling complex multi-modal models like GPT-4. The inherent complexity of these models increases latency, posing challenges even within cloud environments. Furthermore, cloud reliance brings other challenges, including high bandwidth consumption to transfer diverse data types. Worse, in personalized GAI applications like virtual assistants, similar prompts frequently occur, causing redundant transmission and computation of replies, which further increases overhead. Accelerating the inference of multi-modal systems is, therefore, critical in artificial intelligence. In this paper, we aim to improve the inference efficiency through prompt caching; if a current prompt is semantically similar to a previous one, the system can reuse the earlier response without invoking the model again. We leverage collaborative edge computing to cache popular replies and store their request embeddings. New prompts are locally processed to extract embeddings, with their qualities determined by the resources available on edge servers. Our problem is formulated as an optimization to manage offloading decisions for GAI tasks, aiming to avoid cloud inferences and minimize latency while maximizing reply quality. Given its non-convex nature, we propose to solve it via Block Successive Upper Bound Minimization (BSUM). Reinforcement learning is employed to actively pre-cache prompts, tackling the complexity of unknown prompt popularity. Our approach demonstrates near-optimal performance, significantly outperforming cloud-only solutions.

Original languageEnglish
Title of host publication2025 Ieee Wireless Communications And Networking Conference, Wcnc
Editors Ieee
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages7
ISBN (Electronic)9798350368369
ISBN (Print)979-8-3503-6837-6
DOIs
Publication statusPublished - 27 Mar 2025
Event2025 IEEE Wireless Communications and Networking Conference, WCNC 2025 - Milan, Italy
Duration: 24 Mar 202527 Mar 2025

Publication series

NameIeee Wireless Communications And Networking Conference

Conference

Conference2025 IEEE Wireless Communications and Networking Conference, WCNC 2025
Country/TerritoryItaly
CityMilan
Period24/03/2527/03/25

Keywords

  • Bsum
  • Collaborative edge computing
  • Generative AI
  • Llm
  • Prompts caching
  • Rl

Fingerprint

Dive into the research topics of 'Active Prompt Caching in Edge Networks for Generative AI and LLMs: An RL-Based Approach'. Together they form a unique fingerprint.

Cite this