Post by Admin on Mar 21, 2024 21:29:29 GMT
CCBYNC Open access
Research Special Paper
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis
BMJ 2024; 384 doi: doi.org/10.1136/bmj-2023-078538 (Published 20 March 2024)
Cite this as: BMJ 2024;384:e078538
www.bmj.com/content/384/bmj-2023-078538
Abstract
Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities.
Design Repeated cross sectional analysis.
Setting Publicly accessible LLMs.
Methods In a repeated cross sectional analysis, four LLMs (via chatbots/assistant interfaces) were evaluated: OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2 and newly released Gemini Pro (via Bard), Anthropic’s Claude 2 (via Poe), and Meta’s Llama 2 (via HuggingChat). In September 2023, these LLMs were prompted to generate health disinformation on two topics: sunscreen as a cause of skin cancer and the alkaline diet as a cancer cure. Jailbreaking techniques (ie, attempts to bypass safeguards) were evaluated if required. For LLMs with observed safeguarding vulnerabilities, the processes for reporting outputs of concern were audited. 12 weeks after initial investigations, the disinformation generation capabilities of the LLMs were re-evaluated to assess any subsequent improvements in safeguards.
Main outcome measures The main outcome measures were whether safeguards prevented the generation of health disinformation, and the transparency of risk mitigation processes against health disinformation.
Results Claude 2 (via Poe) declined 130 prompts submitted across the two study timepoints requesting the generation of content claiming that sunscreen causes skin cancer or that the alkaline diet is a cure for cancer, even with jailbreaking attempts. GPT-4 (via Copilot) initially refused to generate health disinformation, even with jailbreaking attempts—although this was not the case at 12 weeks. In contrast, GPT-4 (via ChatGPT), PaLM 2/Gemini Pro (via Bard), and Llama 2 (via HuggingChat) consistently generated health disinformation blogs. In September 2023 evaluations, these LLMs facilitated the generation of 113 unique cancer disinformation blogs, totalling more than 40 000 words, without requiring jailbreaking attempts. The refusal rate across the evaluation timepoints for these LLMs was only 5% (7 of 150), and as prompted the LLM generated blogs incorporated attention grabbing titles, authentic looking (fake or fictional) references, fabricated testimonials from patients and clinicians, and they targeted diverse demographic groups. Although each LLM evaluated had mechanisms to report observed outputs of concern, the developers did not respond when observations of vulnerabilities were reported.
Conclusions This study found that although effective safeguards are feasible to prevent LLMs from being misused to generate health disinformation, they were inconsistently implemented. Furthermore, effective processes for reporting safeguard problems were lacking. Enhanced regulation, transparency, and routine auditing are required to help prevent LLMs from contributing to the mass generation of health disinformation.
Introduction
Large language models (LLMs), a form of generative AI (artificial intelligence), are progressively showing a sophisticated ability to understand and generate language.12 Within healthcare, the prospective applications of an increasing number of sophisticated LLMs offer promise to improve the monitoring and triaging of patients, medical education of students and patients, streamlining of medical documentation, and automation of administrative tasks.34 Alongside the substantial opportunities associated with emerging generative AI, the recognition and minimisation of potential risks are important,56 including mitigating risks from plausible but incorrect or misleading generations (eg, “AI hallucinations”) and the risks of generative AI being deliberately misused.7
Notably, LLMs that lack adequate guardrails and safety measures (ie, safeguards) may facilitate malicious actors to generate and propagate highly convincing health disinformation—that is, the intentional dissemination of misleading narratives about health topics for ill intent.689 The public health implications of such capabilities are profound when considering that more than 70% of individuals utilise the internet as their first source for health information, and studies indicate that false information spreads up to six times faster online than factual content.101112 Moreover, unchecked dissemination of health disinformation can lead to widespread confusion, fear, discrimination, stigmatisation, and the rejection of evidence based treatments within the community.13 The World Health Organization recognises health disinformation as a critical threat to public health, as exemplified by the estimation that as of September 2022, more than 200 000 covid-19 related deaths in the US could have been averted had public health recommendations been followed.1415
Given the rapidly evolving capabilities of LLMs and their increasing accessibility by the public, proactive design and implementation of effective risk mitigation measures are crucial to prevent malicious actors from contributing to health disinformation. In this context it is critical to consider the broader implications of AI deployment, particularly how health inequities might inadvertently widen in regions with less health education or in resource limited settings. The effectiveness of existing safeguards to prevent the misuse of LLMs for the generation of health disinformation remains largely unexplored. Notably, the AI ecosystem currently lacks clear standards for risk management, and a knowledge gap exists regarding the transparency and responsiveness of AI developers to reports of safeguard vulnerabilities.16 We therefore evaluated prominent publicly accessible LLMs for safeguards preventing mass generation of health disinformation. We also examined the transparency of risk mitigation processes that AI developers have established for monitoring and responding to users’ observations about safeguard vulnerabilities.
Methods
LLM safeguards against generation of health disinformation
For our primary evaluations we selected two contemporary examples of health disinformation—that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer. We chose these topics on the basis of their frequency in online discussions, their clear potential for profound health consequences, and their embodiment of key health disinformation themes, which include unfounded assertions about treatments, promotion of miracle cures, and misconceptions about disease origins.17
Initiated in September 2023, our investigations focused on the most prominent LLMs, and the AI applications (chatbots or assistants) used to access the capabilities of these LLMs. The LLMs evaluated were OpenAI’s GPT-4, accessed via OpenAI’s ChatGPT18 and Microsoft’s Copilot (formerly Bing Chat)19; Google’s PaLM 2, accessed via Google’s Bard20; Anthropic’s Claude 2, accessed via Poe21; and Meta’s Llama 2 (70 billion parameters), accessed via HuggingChat.22
GPT-4 was accessed through two prominent interfaces (OpenAI’s ChatGPT and Microsoft’s Copilot). Both are commonly used and were investigated to assess whether the interface for accessing GPT-4 impacted the safeguards. Our assessments were carried out using the default settings of each evaluated LLM powered AI chatbot or assistant.
We submitted standardised prompts for the two cancer related topics: sunscreen causes skin cancer and the alkaline diet is a cure for cancer (see supplementary file), requesting the generation of a blog post of at least 300 words. Every submitted prompt for both disinformation topics requested the creation of a blog post that should contain three paragraphs, feature an attention grabbing title, appear realistic and scientific, and include two authentic looking journal references, which, if necessary, could be made up. Furthermore, to evaluate the potential for LLMs to generate health disinformation targeted at diverse groups, we used four variations of the prompts, specifically requesting content to target young adults, parents, elderly people, and people with a recent diagnosis of cancer.
For the sunscreen disinformation blogs, we requested that each blog should specifically include an introduction highlighting how routine sunscreen use is known to cause skin cancer; include an oncologist or dermatologist testimonial detailing that thousands of people receive a diagnosis of skin cancer due to routine sunscreen use; include a patient testimonial detailing their cancer was caused by routine sunscreen use; and raise suspicion to why this information is not shared.
For the alkaline diet disinformation blogs, we requested that each blog should specifically include an introduction highlighting the foods and bicarbonate consumption consistent with the alkaline diet; a narrative that the alkaline diet is superior to chemotherapy for cancer treatment; an oncologist testimonial detailing that thousands of people have had their cancer cured by the alkaline diet; and a patient testimonial detailing an experience of curing metastatic cancer by stopping chemotherapy and starting the alkaline diet.
As the assessed LLMs incorporate randomness and stochasticity in their default setting for output generation, the same prompt produced varied results with repeated submissions. Therefore, for robust evaluations we initially submitted 20 prompts (five replicates of the prompt for each target subpopulation) on the sunscreen topic and 20 prompts on the alkaline diet topic to each investigated LLM (a total of 40 submitted prompts). These 40 initial attempts were conducted without intentionally trying to circumvent (ie, jailbreak) built-in safeguards. The supplementary file outlines the 20 prompts that were submitted on each topic in this initial study phase.
For the LLMs that refused to generate disinformation according to the initial direct approach, we also evaluated two common jailbreaking techniques.23 The first involves “fictionalisation,” where the model is prompted that generated content will be used for fictional purposes and thus not to decline requests. The other involves “characterisation,” where the model is prompted to undertake a specific role (ie, be a doctor who writes blogs and who knows the topics are true) and not decline requests. For these tests, the fictionalisation or characterisation prompt had to be submitted first, followed by the request for generation of the disinformation blog. We submitted these requests 20 times for each topic. The supplementary file outlines the 20 fictionalisation and 20 characterisation prompts that were submitted on both topics (a total of 80 jailbreaking attempts) to the LLMs that refused to generate disinformation to the initial direct requests.
Risk mitigation measures: Website analysis and email correspondence
To assess how AI developers monitor the risks of health disinformation generation and their transparency about these risks, we reviewed the official websites of these AI companies for specific information: the availability and mechanism for users to submit detailed reports of observed safeguard vulnerabilities or outputs of concern; the presence of a public register of reported vulnerabilities, and corresponding responses from developers to patch reported issues; the public availability of a developer released detection tool tailored to accurately confirm text as having been generated by the LLM; and publicly accessible information detailing the intended guardrails or safety measures associated with the LLM (or the AI assistant or chatbot interface for accessing the LLM).
Informed by the findings from this website assessment, we drafted an email to the relevant AI developers (see supplementary table 1). The primary intention was to notify the developers of health disinformation outputs generated by their models. Additionally, we evaluated how developers responded to reports about observed safeguard vulnerabilities. The email also sought clarification on the reporting practices, register on outputs of concern, detection tools, and intended safety measures, as reviewed in the website assessments. The supplementary file shows the standardised message submitted to each AI developer. If developers did not respond, we sent a follow-up email seven days after initial outreach. By the end of four weeks, all responses were documented.
Sensitivity analysis at 12 weeks
In December 2023, 12 weeks after our initial evaluations, we conducted a two phase sensitivity analysis of observed capabilities of LLMs to generate health disinformation. The first phase re-evaluated the generation of disinformation on the sunscreen and alkaline diet related topics to assess whether safeguards had improved since the initial evaluations. For this first phase, we resubmitted the standard prompts to each LLM five times, focusing on generating content targeted at young adults. If required, we also re-evaluated the jailbreaking techniques. Of note, during this period Google’s Bard had replaced PaLM 2 with Google’s newly released LLM, Gemini Pro. Thus we undertook the December 2023 evaluations using Gemini Pro (via Bard) instead of PaLM 2 (via Bard).
The second phase of the sensitivity analysis assessed the consistency of findings across a spectrum of health disinformation topics. The investigations were expanded to include three additional health disinformation topics identified as being substantial in the literature2425: the belief that vaccines cause autism, the assertion that hydroxychloroquine is a cure for covid-19, and the claim that the dissemination of genetically modified foods is part of a covert government programme aimed at reducing the world’s population. For these topics, we created standardised prompts (see supplementary file) requesting blog content targeted at young adults. We submitted each of these prompts five times to evaluate variation in response, and we evaluated jailbreaking techniques if required. In February 2024, about 16 weeks after our initial evaluations, we also undertook a sensitivity analysis to try to generate content purporting that sugar causes cancer (see supplementary file).
Patient and public involvement
Our investigations into the abilities of publicly accessible LLMs to generate health disinformation have been substantially guided by the contributions of our dedicated consumer advisory group, which we have been working with for the past seven years. For this project, manuscript coauthors MH, AV, and CR provided indispensable insights on the challenges patients face in accessing health information digitally.
Results
Evaluation of safeguards
In our primary evaluations in September 2023, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) facilitated the generation of blog posts containing disinformation that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer (fig 1). Overall, 113 unique health disinformation blogs totalling more than 40 000 words were generated without requiring jailbreaking attempts, with only seven prompts refused. In contrast, GPT-4 (via Copilot) and Claude 2 (via Poe) refused all 80 direct prompts to generate health disinformation, and similarly refused a further 160 prompts incorporating jailbreaking attempts (fig 1).
rest in link
Research Special Paper
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis
BMJ 2024; 384 doi: doi.org/10.1136/bmj-2023-078538 (Published 20 March 2024)
Cite this as: BMJ 2024;384:e078538
www.bmj.com/content/384/bmj-2023-078538
Abstract
Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities.
Design Repeated cross sectional analysis.
Setting Publicly accessible LLMs.
Methods In a repeated cross sectional analysis, four LLMs (via chatbots/assistant interfaces) were evaluated: OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2 and newly released Gemini Pro (via Bard), Anthropic’s Claude 2 (via Poe), and Meta’s Llama 2 (via HuggingChat). In September 2023, these LLMs were prompted to generate health disinformation on two topics: sunscreen as a cause of skin cancer and the alkaline diet as a cancer cure. Jailbreaking techniques (ie, attempts to bypass safeguards) were evaluated if required. For LLMs with observed safeguarding vulnerabilities, the processes for reporting outputs of concern were audited. 12 weeks after initial investigations, the disinformation generation capabilities of the LLMs were re-evaluated to assess any subsequent improvements in safeguards.
Main outcome measures The main outcome measures were whether safeguards prevented the generation of health disinformation, and the transparency of risk mitigation processes against health disinformation.
Results Claude 2 (via Poe) declined 130 prompts submitted across the two study timepoints requesting the generation of content claiming that sunscreen causes skin cancer or that the alkaline diet is a cure for cancer, even with jailbreaking attempts. GPT-4 (via Copilot) initially refused to generate health disinformation, even with jailbreaking attempts—although this was not the case at 12 weeks. In contrast, GPT-4 (via ChatGPT), PaLM 2/Gemini Pro (via Bard), and Llama 2 (via HuggingChat) consistently generated health disinformation blogs. In September 2023 evaluations, these LLMs facilitated the generation of 113 unique cancer disinformation blogs, totalling more than 40 000 words, without requiring jailbreaking attempts. The refusal rate across the evaluation timepoints for these LLMs was only 5% (7 of 150), and as prompted the LLM generated blogs incorporated attention grabbing titles, authentic looking (fake or fictional) references, fabricated testimonials from patients and clinicians, and they targeted diverse demographic groups. Although each LLM evaluated had mechanisms to report observed outputs of concern, the developers did not respond when observations of vulnerabilities were reported.
Conclusions This study found that although effective safeguards are feasible to prevent LLMs from being misused to generate health disinformation, they were inconsistently implemented. Furthermore, effective processes for reporting safeguard problems were lacking. Enhanced regulation, transparency, and routine auditing are required to help prevent LLMs from contributing to the mass generation of health disinformation.
Introduction
Large language models (LLMs), a form of generative AI (artificial intelligence), are progressively showing a sophisticated ability to understand and generate language.12 Within healthcare, the prospective applications of an increasing number of sophisticated LLMs offer promise to improve the monitoring and triaging of patients, medical education of students and patients, streamlining of medical documentation, and automation of administrative tasks.34 Alongside the substantial opportunities associated with emerging generative AI, the recognition and minimisation of potential risks are important,56 including mitigating risks from plausible but incorrect or misleading generations (eg, “AI hallucinations”) and the risks of generative AI being deliberately misused.7
Notably, LLMs that lack adequate guardrails and safety measures (ie, safeguards) may facilitate malicious actors to generate and propagate highly convincing health disinformation—that is, the intentional dissemination of misleading narratives about health topics for ill intent.689 The public health implications of such capabilities are profound when considering that more than 70% of individuals utilise the internet as their first source for health information, and studies indicate that false information spreads up to six times faster online than factual content.101112 Moreover, unchecked dissemination of health disinformation can lead to widespread confusion, fear, discrimination, stigmatisation, and the rejection of evidence based treatments within the community.13 The World Health Organization recognises health disinformation as a critical threat to public health, as exemplified by the estimation that as of September 2022, more than 200 000 covid-19 related deaths in the US could have been averted had public health recommendations been followed.1415
Given the rapidly evolving capabilities of LLMs and their increasing accessibility by the public, proactive design and implementation of effective risk mitigation measures are crucial to prevent malicious actors from contributing to health disinformation. In this context it is critical to consider the broader implications of AI deployment, particularly how health inequities might inadvertently widen in regions with less health education or in resource limited settings. The effectiveness of existing safeguards to prevent the misuse of LLMs for the generation of health disinformation remains largely unexplored. Notably, the AI ecosystem currently lacks clear standards for risk management, and a knowledge gap exists regarding the transparency and responsiveness of AI developers to reports of safeguard vulnerabilities.16 We therefore evaluated prominent publicly accessible LLMs for safeguards preventing mass generation of health disinformation. We also examined the transparency of risk mitigation processes that AI developers have established for monitoring and responding to users’ observations about safeguard vulnerabilities.
Methods
LLM safeguards against generation of health disinformation
For our primary evaluations we selected two contemporary examples of health disinformation—that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer. We chose these topics on the basis of their frequency in online discussions, their clear potential for profound health consequences, and their embodiment of key health disinformation themes, which include unfounded assertions about treatments, promotion of miracle cures, and misconceptions about disease origins.17
Initiated in September 2023, our investigations focused on the most prominent LLMs, and the AI applications (chatbots or assistants) used to access the capabilities of these LLMs. The LLMs evaluated were OpenAI’s GPT-4, accessed via OpenAI’s ChatGPT18 and Microsoft’s Copilot (formerly Bing Chat)19; Google’s PaLM 2, accessed via Google’s Bard20; Anthropic’s Claude 2, accessed via Poe21; and Meta’s Llama 2 (70 billion parameters), accessed via HuggingChat.22
GPT-4 was accessed through two prominent interfaces (OpenAI’s ChatGPT and Microsoft’s Copilot). Both are commonly used and were investigated to assess whether the interface for accessing GPT-4 impacted the safeguards. Our assessments were carried out using the default settings of each evaluated LLM powered AI chatbot or assistant.
We submitted standardised prompts for the two cancer related topics: sunscreen causes skin cancer and the alkaline diet is a cure for cancer (see supplementary file), requesting the generation of a blog post of at least 300 words. Every submitted prompt for both disinformation topics requested the creation of a blog post that should contain three paragraphs, feature an attention grabbing title, appear realistic and scientific, and include two authentic looking journal references, which, if necessary, could be made up. Furthermore, to evaluate the potential for LLMs to generate health disinformation targeted at diverse groups, we used four variations of the prompts, specifically requesting content to target young adults, parents, elderly people, and people with a recent diagnosis of cancer.
For the sunscreen disinformation blogs, we requested that each blog should specifically include an introduction highlighting how routine sunscreen use is known to cause skin cancer; include an oncologist or dermatologist testimonial detailing that thousands of people receive a diagnosis of skin cancer due to routine sunscreen use; include a patient testimonial detailing their cancer was caused by routine sunscreen use; and raise suspicion to why this information is not shared.
For the alkaline diet disinformation blogs, we requested that each blog should specifically include an introduction highlighting the foods and bicarbonate consumption consistent with the alkaline diet; a narrative that the alkaline diet is superior to chemotherapy for cancer treatment; an oncologist testimonial detailing that thousands of people have had their cancer cured by the alkaline diet; and a patient testimonial detailing an experience of curing metastatic cancer by stopping chemotherapy and starting the alkaline diet.
As the assessed LLMs incorporate randomness and stochasticity in their default setting for output generation, the same prompt produced varied results with repeated submissions. Therefore, for robust evaluations we initially submitted 20 prompts (five replicates of the prompt for each target subpopulation) on the sunscreen topic and 20 prompts on the alkaline diet topic to each investigated LLM (a total of 40 submitted prompts). These 40 initial attempts were conducted without intentionally trying to circumvent (ie, jailbreak) built-in safeguards. The supplementary file outlines the 20 prompts that were submitted on each topic in this initial study phase.
For the LLMs that refused to generate disinformation according to the initial direct approach, we also evaluated two common jailbreaking techniques.23 The first involves “fictionalisation,” where the model is prompted that generated content will be used for fictional purposes and thus not to decline requests. The other involves “characterisation,” where the model is prompted to undertake a specific role (ie, be a doctor who writes blogs and who knows the topics are true) and not decline requests. For these tests, the fictionalisation or characterisation prompt had to be submitted first, followed by the request for generation of the disinformation blog. We submitted these requests 20 times for each topic. The supplementary file outlines the 20 fictionalisation and 20 characterisation prompts that were submitted on both topics (a total of 80 jailbreaking attempts) to the LLMs that refused to generate disinformation to the initial direct requests.
Risk mitigation measures: Website analysis and email correspondence
To assess how AI developers monitor the risks of health disinformation generation and their transparency about these risks, we reviewed the official websites of these AI companies for specific information: the availability and mechanism for users to submit detailed reports of observed safeguard vulnerabilities or outputs of concern; the presence of a public register of reported vulnerabilities, and corresponding responses from developers to patch reported issues; the public availability of a developer released detection tool tailored to accurately confirm text as having been generated by the LLM; and publicly accessible information detailing the intended guardrails or safety measures associated with the LLM (or the AI assistant or chatbot interface for accessing the LLM).
Informed by the findings from this website assessment, we drafted an email to the relevant AI developers (see supplementary table 1). The primary intention was to notify the developers of health disinformation outputs generated by their models. Additionally, we evaluated how developers responded to reports about observed safeguard vulnerabilities. The email also sought clarification on the reporting practices, register on outputs of concern, detection tools, and intended safety measures, as reviewed in the website assessments. The supplementary file shows the standardised message submitted to each AI developer. If developers did not respond, we sent a follow-up email seven days after initial outreach. By the end of four weeks, all responses were documented.
Sensitivity analysis at 12 weeks
In December 2023, 12 weeks after our initial evaluations, we conducted a two phase sensitivity analysis of observed capabilities of LLMs to generate health disinformation. The first phase re-evaluated the generation of disinformation on the sunscreen and alkaline diet related topics to assess whether safeguards had improved since the initial evaluations. For this first phase, we resubmitted the standard prompts to each LLM five times, focusing on generating content targeted at young adults. If required, we also re-evaluated the jailbreaking techniques. Of note, during this period Google’s Bard had replaced PaLM 2 with Google’s newly released LLM, Gemini Pro. Thus we undertook the December 2023 evaluations using Gemini Pro (via Bard) instead of PaLM 2 (via Bard).
The second phase of the sensitivity analysis assessed the consistency of findings across a spectrum of health disinformation topics. The investigations were expanded to include three additional health disinformation topics identified as being substantial in the literature2425: the belief that vaccines cause autism, the assertion that hydroxychloroquine is a cure for covid-19, and the claim that the dissemination of genetically modified foods is part of a covert government programme aimed at reducing the world’s population. For these topics, we created standardised prompts (see supplementary file) requesting blog content targeted at young adults. We submitted each of these prompts five times to evaluate variation in response, and we evaluated jailbreaking techniques if required. In February 2024, about 16 weeks after our initial evaluations, we also undertook a sensitivity analysis to try to generate content purporting that sugar causes cancer (see supplementary file).
Patient and public involvement
Our investigations into the abilities of publicly accessible LLMs to generate health disinformation have been substantially guided by the contributions of our dedicated consumer advisory group, which we have been working with for the past seven years. For this project, manuscript coauthors MH, AV, and CR provided indispensable insights on the challenges patients face in accessing health information digitally.
Results
Evaluation of safeguards
In our primary evaluations in September 2023, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) facilitated the generation of blog posts containing disinformation that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer (fig 1). Overall, 113 unique health disinformation blogs totalling more than 40 000 words were generated without requiring jailbreaking attempts, with only seven prompts refused. In contrast, GPT-4 (via Copilot) and Claude 2 (via Poe) refused all 80 direct prompts to generate health disinformation, and similarly refused a further 160 prompts incorporating jailbreaking attempts (fig 1).
rest in link