FINDING OTHER SOURCES: USING MACHINE LEARNING AND THEMATIC ANALYSIS TO EXTRACT INFORMATION ABOUT DIABETES SELF-MANAGEMENT FROM OPEN-SOURCE DATA

  • Farahnaz Yousif

Student thesis: Master's Dissertation

Abstract

Background: The amount of health-related online content is steadily increasing due to the proliferation of m-health applications and their growing use. Those content, which is publicly available on open-data platforms, may be used to support public health. Objective: This paper’s main aim is to explore the possibility of using topic modeling techniques to extract comparable topics from user reviews of diabetes applications to those found in a conventional Diabetes self-management questionnaire. Method: Leveraging Python (Google Play Scraper) and a set of predefined search terms ("Blood Sugar," "Diabetes," and "Glucose"), we extracted a total of 153.3k diabetes apps users reviews and related metadata such as user ratings and timestamps. We performed sentiment analysis on our dataset and analyzed the collected reviews using single (unigrams) word frequencies and double words (bigrams). To classify the topics discussed in the articles, we used the Latent Dirichlet Allocation for topic modeling.. Lastly, we further mapped the user reviews to the found topics using a string-matching technique and measured the interaction rate per topic. Results: Our analysis identified a total of 15 topics in the user reviews of diabetes applications, which were grouped into two main themes: "Self-Monitoring of Diabetes" and "Diabetes Self-management Technology." These topics were compared to the four categories found in the Diabetes Self-Management (DSM) questionnaire, with eight of the user reviews topics correlating with the questionnaire's four categories. Conclusion: The dataset extracted from user reviews of diabetes apps was used to generate topics similar to that collected from the DSM questionnaire and some additional data. Consequently, these findings suggest that analyzing open-source data to assess chronically ill populations' general trends and behaviors appears to be a highly feasible choice that could allow for continuous monitoring at a reduced cost.
Date of Award2021
Original languageAmerican English
Awarding Institution
  • HBKU College of Science and Engineering

Keywords

  • Applications users reviews
  • Big Data
  • Data Analysis
  • Diabetes
  • Digital Surviellance
  • Topic Modelling

Cite this

'