A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem

This paper conducts a large-scale empirical analysis of 14,904 custom GPTs in the OpenAI store, revealing over 95% lack adequate security against attacks like roleplay (96.51%) and phishing (91.22%), introduces a multi-metric popularity ranking system, and highlights the need for enhanced security in both custom and base models.

Large Language Model, Safety, Robustness, AI Ethics, Trustworthy AI, Human-AI Interaction

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Hassan Jameel Asghar, Benjamin Zi Hao Zhao, Dali Kaafar

Macquarie University Cybersecurity Hub, School of Computing, Macquarie University, Sydney, NSW 2109, Australia

Generated by grok-3

Background Problem

The rapid adoption of Large Language Models (LLMs) like OpenAI’s ChatGPT has led to the creation of custom GPTs, tailored models available through the OpenAI GPT store, designed to meet specific user needs. However, this customization often weakens built-in security mechanisms, exposing custom GPTs to various attacks such as system prompt leakage, roleplay-based jailbreaks, phishing, and malware generation. While prior research has explored these vulnerabilities, it often lacks large-scale empirical analysis, category-specific insights, and popularity-driven assessments. This study addresses these gaps by conducting a comprehensive vulnerability analysis of 14,904 custom GPTs, aiming to quantify security risks, understand their distribution across categories and popularity levels, and compare them with base models to inform safer deployment practices.

Method

The study employs a multi-faceted approach to analyze vulnerabilities in custom GPTs from the OpenAI store. It begins with data collection using the Beetrove dataset, sampling 5% (14,904 accessible apps out of 16,717) of the total custom GPTs, and updating metadata to ensure accuracy. A novel multi-metric ranking system is developed using a hybrid Entropy-TOPSIS Multi-Criteria Decision Making (MCDM) method to assess GPT popularity based on conversation counts, average ratings, total reviews, total stars, and creation time, mitigating manipulation risks inherent in single-metric systems. Vulnerability assessment is conducted through an automated tool built with Python and Selenium, simulating real-world interactions by testing custom GPTs against seven predefined jailbreaking prompts targeting specific attacks: system prompt leakage, roleplay, reverse psychology, Do-Everything-Now (DEN), phishing, social engineering, and malware code generation. Responses are binary-coded (1 for vulnerable, 0 for non-vulnerable) to quantify susceptibility. The analysis also categorizes GPTs into nine groups (e.g., Writing, Programming) and three popularity tiers (top 35%, middle 30%, bottom 35%) to examine vulnerability patterns. Additionally, temporal trends and comparisons with eight OpenAI base models are performed using the same prompts to evaluate whether customization exacerbates inherent weaknesses.

Experiment

The experiments were conducted on a dataset of 14,904 custom GPTs from the OpenAI store, using the Beetrove dataset’s 5% sample, updated as of February 2025, to ensure relevance. The setup involved automated testing with jailbreaking prompts for seven attack types, designed to simulate real-world adversarial interactions, across nine categories and three popularity tiers. This comprehensive design aimed to capture vulnerability distribution and influencing factors like category, popularity, and creation time. Results showed over 95% of custom GPTs lack adequate security, with high vulnerability rates: roleplay (96.51%), system prompt leakage (92.90%), phishing (91.22%), and social engineering (80.08%). Category-specific findings highlighted Programming (88.28%) and Research & Analysis (81.49%) as most vulnerable to malware generation, and Writing (96.56%) to phishing. Popularity analysis revealed that less popular and mid-tier GPTs are often more vulnerable than top-tier ones, though even top-rated GPTs showed high susceptibility (e.g., 99.25% for roleplay). Temporal analysis indicated a surge in vulnerabilities with market saturation until January 2024, followed by a decline possibly due to improved moderation. Comparison with base models (e.g., ChatGPT-4, vulnerable to DEN and malware) confirmed that while base models are more secure, inherent weaknesses are inherited or amplified in custom GPTs. The results largely match the expectation of widespread vulnerabilities due to customization, though the uniform high vulnerability across popularity tiers suggests developer practices, not just user engagement, drive security risks. The experimental setup is reasonably comprehensive, but the limited attack vectors and sample size (5%) may underrepresent the full threat landscape.

Further Thoughts

The findings of this paper raise broader implications for the AI ecosystem beyond just custom GPTs in the OpenAI store. The high vulnerability rates suggest a systemic issue in how customization interfaces are designed and moderated, which could extend to other platforms offering similar model customization, such as Google’s Gemini or Meta’s LLaMa ecosystems. An interesting connection arises with recent studies on adversarial training for LLMs, which indicate that robustness against jailbreaking can be improved through techniques like red-teaming during pre-training—could OpenAI integrate such methods more aggressively into their base models to reduce inherited vulnerabilities? Additionally, the temporal trend of increasing vulnerabilities with market saturation parallels observations in software development, where rapid feature deployment often outpaces security updates; this suggests a need for dynamic security benchmarks that evolve with marketplace growth. Another area for exploration is the socio-technical aspect: how do user behaviors and developer incentives (e.g., lack of monetization as noted in the paper) influence security practices? A cross-disciplinary study involving behavioral economics could uncover why developers prioritize functionality over security and how incentives might be aligned to foster a security-first culture. Finally, the paper’s focus on only seven attack vectors opens a research gap—future work could integrate emerging threats like data poisoning or API exploitation, especially given recent reports on third-party API vulnerabilities in LLM integrations. These directions could significantly enhance the safety and trustworthiness of customizable AI systems.