SecurityBrief New Zealand - Technology news for CISOs & cybersecurity decision-makers
Story image

Microsoft research highlights vulnerabilities in ChatGPT models

In a study titled "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models," researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research have delved deep into the trustworthiness of generative pre-trained transformer (GPT) models. The paper, which garnered attention as an oral presentation at NeurIPS 2023, puts the spotlight on GPT-4 and GPT-3.5 models.

The comprehensive assessment encompassed a broad spectrum of perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, privacy, machine ethics, and fairness. The team's findings revealed previously unknown vulnerabilities in the trustworthiness of these models. Notably, they found that GPT models can be misled to produce toxic and biased outputs and can inadvertently divulge private data from training and conversation history. Interestingly, while GPT-4 often outperforms GPT-3.5 in standard benchmarks, it is more susceptible to maliciously crafted prompts aimed at circumventing the security features of these language models.

The researchers observed, "Although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely."

A pivotal revelation from the research was the benchmarking system the team established, which they have made publicly accessible. They hope this will bolster further research in the field and preempt the potential exploitation of vulnerabilities by malicious actors.

The team's collaboration with Microsoft product groups confirmed that the vulnerabilities identified do not pose risks to current customer-centric services. This assurance stems from the fact that AI applications undergo rigorous mitigation procedures to counteract potential harm that could arise at the model level. The researchers have also liaised with OpenAI, GPT's developer, which has acknowledged the potential vulnerabilities in its system cards for pertinent models.

The broader aim, as expressed by the researchers, is to "encourage others in the research community to utilise and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm."

The research also underscores the significant progress in machine learning, particularly in large language models that have found applications in diverse fields, from chatbots to robotics. The focus on trustworthiness is of paramount importance, especially as GPT models are being considered for sensitive sectors like healthcare and finance.

In their conclusion, the researchers emphasised the imperative of continuing this work. They stated, "This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward."

Follow us on: