Leveraging LLM Embeddings and Reverse Dictionaries for Reliable Topic Modeling and Privacy-Sensitive Smart City Applications: Toward Residents’ Satisfaction and Safety

  • Eiman Mohammed

Student thesis: Master's Dissertation

Abstract

This thesis aims to reliably identify the latent topics that concern smart city residents and protect their privacy when they use Large Language Models (LLMs). In particular, it seeks to iden- tify residents’ key perceptions and reactions towards global issues, including climate change. For example, it explores the dominant news discussed and shared by smart city residents. For that, we propose a topic modeling approach that leverages the superiority of model embed- dings for semantic search and the reliability of reverse dictionaries to identify relevant topics. This novel topic modeling approach competes with state-of-the-art approaches, including La- tent Dirichlet Allocation (LDA) and BERTopic. With accurate topic identification, smart city decision-makers can prioritize and tailor relevant services to enhance residents’ satisfaction and quality of life. Moreover, this thesis aims to protect the privacy of smart city residents. In the era of Large Language Models (LLM), smart city residents are expected to frequently interact and utilize this technology,i.e., by asking diverse questions. However, these questions are not always gen- eral; some are privacy-sensitive, making it critical to protect the prompting privacy of LLM end users. This need becomes even more pressing for scenarios where smart city residents do not fully trust the LLM’s service provider, which is a reasonable concern for privacy-conscious resi- dents. Typical privacy protection techniques, including encryption, have limitations on prompt- ing privacy because typical LLM frameworks require decoding prompts to plain text to be able to process them and generate prompt-related responses. That means that providers would know the exact prompts, exposing end-users to privacy risks. To address this privacy concern, we propose submitting the associated embeddings of the prompts instead of the plain prompts. Since embeddings are irreversible, LLM providers would reconstruct and utilize the approxi- mate (not exact) prompts, mitigating the risk of confidently identifying the exact prompts, and preventing potential serious privacy consequences. For the prompt reconstruction process to strike a tradeoff between the prompting privacy and utility (maintaining the semantic meaning of the prompt), we propose utilizing the embeddings of reverse dictionaries, which can reliably project the embeddings of user prompts to a semantically relevant item within the reverse dic- tionary. By reliable topic modeling and mitigation of privacy concerns relevant to LLM usage, this thesis ensures actionable insights and safeguards resident privacy in smart city applications.
Date of Award2025
Original languageAmerican English
Awarding Institution
  • HBKU College of Science and Engineering

Keywords

  • None

Cite this

'