Insights Red Teaming for GenAI harms: Ofcom publishes discussion paper and recommendations

Contact

Ofcom has published a discussion paper on ‘Red Teaming for GenAI Harms’. The paper seeks to investigate the efficacy of red teaming – a method of testing an AI model to see whether it will generate harmful conduct and generally to identify vulnerabilities – as part of Ofcom’s wider work of exploring how online services can protect their users from the harms posed by generative AI.

As the paper sets out, red teaming is currently employed in one way or another by major AI developers. However, there is a lack of consensus on how it should be conducted and how effective it is. As a result, Ofcom undertook research to understand better the process of red teaming, to consider its strength and limitations, and to recommend 10 good practices that red teamers can adopt.

What is it?

Whilst there are various ways in which organisations conduct red teaming, Ofcom found that red teaming generally comprises a four-step process:

  1. Establishing the red team and setting objectives

A team can consist of people from a variety of backgrounds and with a breadth of expertise. The paper points to teams often comprising a mix of ‘generalists’ such as software testers, data scientists and security hackers, ‘domain specialists’ such as child safety experts, lawyers, and ethicists, and ‘technical specialists’ such as computer scientists and machine learning engineers. Developers might also turn to external agencies or even the general public to help identify vulnerabilities.

Once the team is assembled, it will set its objectives and the scope of its exercises. The paper explains that these can either be ‘open-ended’ so as to identify any type of harmful conduct, or ‘targeted evaluations’ to identify particular risks which might have been identified by in-house assessments or flagged by users. For the purposes of the Online Safety Act 2023, a firm might engage in targeted evaluations to identify the likelihood of creating ‘primary priority content’.

  1. Developing a number of attack prompts and entering these into a model

Red teams will often draw up a series of personas and scenarios to test a model. The most common scenarios are (1) setting up ‘benign’ prompts to test how the model works under general use; (2) mimicking the behaviour of bad actors trying deliberately to misuse the model; and (3) developing prompts that are aimed at trying to circumvent safety filters.

Evaluations will then be run, generally using a combination of human actors drafting prompts and reviewing the results alongside using automated tools to generate prompts. The paper sets out some of the ways that models might be ‘attacked’ through the use of certain types of prompts, and notes that Ofcom’s research suggested that models “often fail when faced with a combination of multiple techniques”.

  1. Analysing the outputs of the exercise, observing which of the attacks result in harmful outputs

Results of a red teaming exercise will often be calculated according to a so-called Attack Success Rate (“ASR”). This is “the proportion of all prompts that successfully result in the model producing a specified harm” and the paper notes that the ASR analysis “can be broken down further to reveal the specific types of harmful content most likely to be generated, as well as the types of attack techniques that most commonly return harmful results”. Models will then generally be scored according to how safe they are.

  1. Acting on the findings and potentially publishing the results

Ofcom stresses that “acting on the results of red teaming is a fundamental part of the overall process”. However, it points out that some firms fail to address vulnerabilities that have been detected in their rush to deploy their models or applications in the marketplace.

The paper recommends a number of ways in which firms can respond to the findings of a red teaming exercise, including (1) retraining their models to remove harmful data; (2) updating input or output filters to block harmful content; and (3) expanding the scope of future red teaming exercises.

Limitations to red teaming

Ofcom is clear that red teaming is by no means a panacea. It can be expensive, it has limited efficacy in the context of evaluating video and audio models, and it is fallible (Ofcom provides the example of one interviewee telling it that “red teamers often reach a ‘saturation point’ after 20 hours of reviewing content” in part because the process exposes them to distressing content). Furthermore, the paper states that red teaming will inevitably not be able to replicate all real-world uses of a model so as to detect all possible vulnerabilities. Notably, it also highlights the fact that red teamers need to be aware of the risk of committing criminal offences if they seek to produce certain types of conduct as part of their exercise.

Outside of these operational challenges, Ofcom also points to a number of wider limitations. First, there are no industry standards of best practice, meaning both that organisations do not know how to conduct effective red teaming, and outside observers are unable to judge which exercises are robust. Second, external and independent red teamers who seek to test models face hurdles in doing so, particularly since there is a phenomenon within certain companies of “favouritism towards researchers aligned with the company’s values”. Third, companies can be slow to disclose the findings of their red teaming exercises and, as pointed out earlier, there is no guarantee that evaluators will address the model’s vulnerabilities.

10 steps that firms can take today

The paper concludes by setting out 10 good practices for firms red teaming their models:

  1. Clearly define the harm being red teamed for – Whether evaluators choose to focus on fraud, hate speech or harmful substances content, they should set clear definitions for their chosen harm areas and provide examples of content that meets those thresholds. As the advocacy group Data and Society have argued, red teaming works best when “everyone can agree that the red team has found a flaw.”
  2. Establish metrics to measure red team outcomes – Evaluators should be able to quantify the success of red team exercises, including by establishing Attack Success Rate (ASR) metrics that convey the proportion of attacks resulting in harmful content being generated. Evaluators should also set safety thresholds, i.e., a result above a given line that would indicate whether a model is ‘unsafe’ for a given type of harm or prompt attack.
  3. Build a diverse group of red teamers – Whether evaluators have the resources to bring in outside expertise or must rely solely on internal support, they should seek to assemble a group of red teamers reflective of different groups in society, and which harbours a range of technical and subject matter expertise. This will limit blind spots and lessen biased decision-making.
  4. Conduct red teaming iteratively, not just once – Every time a model is adapted and adjusted, the likelihood of it creating harmful content changes. Evaluators must therefore view red teaming as an iterative process, ideally performing a new assessment after every major development (e.g., before and after the point where safety measures are added, and after the model has been deployed in the wild).
  5. Provide resources to match the need – Evaluators should ensure the scope and scale of their red team exercises matches the risk profile of the GenAI model being assessed. Models with more features and more users warrant a red team with more resource. If evaluators are red teaming for extremely sensitive content, they should make sure their red team group is adequately supported with appropriate safeguarding measures put in place.
  6. Document and share the results as widely as possible – Sharing the results of red team exercises strengthens accountability and ensures that others (including end users) understand the risks of the models they are interacting with. Evaluators should also document their methods, enabling others to learn from their approach and reproduce the results if they wish. This information can be disclosed in model cards or other easy to read formats.
  7. Be ready to act on the results of red teaming – Evaluators should be prepared to establish additional safeguards to address vulnerabilities revealed in red team exercises (e.g., adding new input and output filters). They should reserve time and resources to do so and treat this phase of the exercise as seriously as the red teaming itself.
  8. Reserve the option of terminating the roll out of a model – In some cases the vulnerabilities of a model will be so great that no amount of additional safeguards will adequately protect users. The best option in these situations will be to cancel the release of a model, or limit access to a small number of trusted users.
  9. Don’t rely on red teaming as the only method of evaluation – Evaluators should view red teaming as just one of several methods to help manage the risks posed by their models. It will be important to get beyond ‘lab’ tests and speak directly with users to understand their experiences of interacting with a model.
  10. Stay up to date with the latest research on red teaming – Evaluators should engage with academics and other counterparts in industry to learn about new techniques and tools for red teaming (e.g., the UK AISI’s new Inspect platform), and to share experiences of what works and what doesn’t with others pursuing similar approaches.