TY - GEN
T1 - Active Prompt Caching in Edge Networks for Generative AI and LLMs
T2 - 2025 IEEE Wireless Communications and Networking Conference, WCNC 2025
AU - Baccour, Emna
AU - Erbad, Aiman
AU - Mohamed, Amr
AU - Hamdi, Mounir
AU - Guizani, Mohsen
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/3/27
Y1 - 2025/3/27
N2 - Generative AI (GAI) and Large Language Models (LLMs) have revolutionized natural language processing and content creation. However, their significant computational demands during inference often require cloud servers, which are currently the only viable option for handling complex multi-modal models like GPT-4. The inherent complexity of these models increases latency, posing challenges even within cloud environments. Furthermore, cloud reliance brings other challenges, including high bandwidth consumption to transfer diverse data types. Worse, in personalized GAI applications like virtual assistants, similar prompts frequently occur, causing redundant transmission and computation of replies, which further increases overhead. Accelerating the inference of multi-modal systems is, therefore, critical in artificial intelligence. In this paper, we aim to improve the inference efficiency through prompt caching; if a current prompt is semantically similar to a previous one, the system can reuse the earlier response without invoking the model again. We leverage collaborative edge computing to cache popular replies and store their request embeddings. New prompts are locally processed to extract embeddings, with their qualities determined by the resources available on edge servers. Our problem is formulated as an optimization to manage offloading decisions for GAI tasks, aiming to avoid cloud inferences and minimize latency while maximizing reply quality. Given its non-convex nature, we propose to solve it via Block Successive Upper Bound Minimization (BSUM). Reinforcement learning is employed to actively pre-cache prompts, tackling the complexity of unknown prompt popularity. Our approach demonstrates near-optimal performance, significantly outperforming cloud-only solutions.
AB - Generative AI (GAI) and Large Language Models (LLMs) have revolutionized natural language processing and content creation. However, their significant computational demands during inference often require cloud servers, which are currently the only viable option for handling complex multi-modal models like GPT-4. The inherent complexity of these models increases latency, posing challenges even within cloud environments. Furthermore, cloud reliance brings other challenges, including high bandwidth consumption to transfer diverse data types. Worse, in personalized GAI applications like virtual assistants, similar prompts frequently occur, causing redundant transmission and computation of replies, which further increases overhead. Accelerating the inference of multi-modal systems is, therefore, critical in artificial intelligence. In this paper, we aim to improve the inference efficiency through prompt caching; if a current prompt is semantically similar to a previous one, the system can reuse the earlier response without invoking the model again. We leverage collaborative edge computing to cache popular replies and store their request embeddings. New prompts are locally processed to extract embeddings, with their qualities determined by the resources available on edge servers. Our problem is formulated as an optimization to manage offloading decisions for GAI tasks, aiming to avoid cloud inferences and minimize latency while maximizing reply quality. Given its non-convex nature, we propose to solve it via Block Successive Upper Bound Minimization (BSUM). Reinforcement learning is employed to actively pre-cache prompts, tackling the complexity of unknown prompt popularity. Our approach demonstrates near-optimal performance, significantly outperforming cloud-only solutions.
KW - Bsum
KW - Collaborative edge computing
KW - Generative AI
KW - Llm
KW - Prompts caching
KW - Rl
UR - https://www.scopus.com/pages/publications/105006470020
U2 - 10.1109/WCNC61545.2025.10978306
DO - 10.1109/WCNC61545.2025.10978306
M3 - Conference contribution
AN - SCOPUS:105006470020
SN - 979-8-3503-6837-6
T3 - Ieee Wireless Communications And Networking Conference
BT - 2025 Ieee Wireless Communications And Networking Conference, Wcnc
A2 - Ieee, null
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 March 2025 through 27 March 2025
ER -